使用nltk搜索相似意义短语

8 投票
2 回答
4455 浏览
提问于 2025-04-17 22:39

我有一堆不相关的段落,我需要遍历这些段落,找出相似的内容。比如说,当我搜索 objectfalls 时,我希望能找到包含以下内容的文本,并返回布尔值 True

  • 盒子从架子上掉下来了
  • 灯泡在地上碎了
  • 一块石膏从天花板上掉下来了

而对于以下内容,我希望返回 False

  • 责任落在了莎拉身上
  • 温度突然下降

我可以使用 nltk 来进行 tokenise(分词)、tag(标记)和获取 Wordnetsynsets(同义词集),但我发现很难把 nltk 的各个部分组合在一起,以达到我想要的结果。我应该在查找同义词集之前先进行 chunk(分块)吗?我需要写一个 context-free grammar(无上下文语法)吗?在将 treebank 标签转换为 Wordnet 语法标签时,有没有什么最佳实践?这些内容在 nltk 书籍中没有解释,我在 nltk 食谱中也找不到相关信息。

如果答案中包含 pandas,会加分哦。


[ 编辑 ]:

一些代码来帮助入门

In [1]:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series

def tag(x):
    return pos_tag(word_tokenize(x))

phrases = ['Box fell from shelf',
           'Bulb shattered on the ground',
           'A piece of plaster fell from the ceiling',
           'The blame fell on Sarah',
           'Berlin fell on May',
           'The temperature fell abruptly']

ser = Series(phrases)
ser.map(tag)

Out[1]:

0    [(Box, NNP), (fell, VBD), (from, IN), (shelf, ...
1    [(Bulb, NNP), (shattered, VBD), (on, IN), (the...
2    [(A, DT), (piece, NN), (of, IN), (plaster, NN)...
3    [(The, DT), (blame, NN), (fell, VBD), (on, IN)...
4    [(Berlin, NNP), (fell, VBD), (on, IN), (May, N...
5    [(The, DT), (temperature, NN), (fell, VBD), (a...
dtype: object

2 个回答

0

虽然还不够完美,但大部分工作已经完成了。接下来要做的是把代词(比如“它”)和封闭类词写死在代码里,还要添加多个目标来处理像“破碎”这样的情况。这不是一句话能解决的,但也不是不可能完成的任务!

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series, DataFrame
import collections
from nltk import wordnet
wn = wordnet.wordnet

def tag(x):
    return pos_tag(word_tokenize(x))

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
            for sub in flatten(el):
                yield sub
        else:
            yield el

def noun_verb_match(phrase, nouns, verbs):
    res = []
    for i in range(len(phrase) -1):
        if (phrase[i][1] in nouns) &\
            (phrase[i + 1][1] in verbs):
            res.append((phrase[i], phrase[i + 1]))
    return res

def hypernym_paths(word, pos):
    res = [x.hypernym_paths() for x in wn.synsets(word, pos)]
    return set(flatten(res))

def bool_syn(double, noun_syn, verb_syn):
    """
    Returns boolean if noun/verb double contains the target Wordnet Synsets.
    Arguments:
    double: ((noun, tag), (verb, tag))
    noun_syn, verb_syn: Wordnet Synset string (i.e., 'travel.v.01')
    """
    noun = double[0][0]
    verb = double[1][0]
    noun_bool = wn.synset(noun_syn) in hypernym_paths(noun, 'n')
    verb_bool = wn.synset(verb_syn) in hypernym_paths(verb, 'v')
    return noun_bool & verb_bool

def bool_loop(l, f):
    """
    Tests all list elements for truthiness and
    returns True if any is True.
    Arguments:
    l: List.
    e: List element.
    f: Function returning boolean.
    """
    if len(l) == 0:
        return False
    else:
        return f(l[0]) | bool_loop(l[1:], f)

def bool_noun_verb(series, nouns, verbs, noun_synset_target, verb_synset_target):
    tagged = series.map(tag)
    nvm = lambda x: noun_verb_match(x, nouns, verbs)
    matches = tagged.apply(nvm)
    bs = lambda x: bool_syn(x, noun_synset_target, verb_synset_target)
    return matches.apply(lambda x: bool_loop(x, bs))

phrases = ['Box fell from shelf',
           'Bulb shattered on the ground',
           'A piece of plaster fell from the ceiling',
           'The blame fell on Sarah',
           'Berlin fell on May',
           'The temperature fell abruptly',
           'It fell on the floor']

nouns = "NN NNP PRP NNS".split()
verbs = "VB VBD VBZ".split()
noun_synset_target = 'artifact.n.01'
verb_synset_target = 'travel.v.01'

df = DataFrame()
df['text'] = Series(phrases)
df['fall'] = bool_noun_verb(df.text, nouns, verbs, noun_synset_target, verb_synset_target)
df
7

我会这样做:

使用nltk来找名词后面跟着一到两个动词。为了符合你的具体要求,我会用Wordnet:要找到的名词(NN, NNP, PRP, NNS)必须和“物理”或“材料”有关系,而要找到的动词(VB, VBZ, VBD等)则必须和“掉落”有关系。

我提到“一个或两个动词”是因为动词前面可能会有助动词。你也可以创建一个依赖树来找出主语和动词之间的关系,但在这种情况下似乎不是必要的。

你可能还想确保排除地点名称,保留人名(因为你会接受“约翰掉了”,但不接受“柏林掉了”)。这也可以通过Wordnet来实现,地点有一个标签叫'noun.location'。

我不太确定你需要在什么情况下转换这些标签,所以不能给出准确的答案。在我看来,这种情况下你可能不需要这样做:你使用词性标签来识别名词和动词,然后检查每个名词和动词是否属于一个同义词集。

希望这对你有帮助。

撰写回答