如何使用spacy/n生成bi/tri-grams

网友

1楼 · 编辑于 2024-05-16 10:49:52

使用NLTK可以通过几个步骤实现这一点：

PoS标记序列
生成所需的n-grams（在您的示例中，没有trigrams，但是跳过可以通过trigrams生成的grams，然后打孔中间的标记）
丢弃所有与模式不匹配的n-gramsJJ NN。

示例：

def jjnn_pairs(phrase):
    '''
    Iterate over pairs of JJ-NN.
    '''
    tagged = nltk.pos_tag(nltk.word_tokenize(phrase))
    for ngram in ngramise(tagged):
        tokens, tags = zip(*ngram)
        if tags == ('JJ', 'NN'):
            yield tokens

def ngramise(sequence):
    '''
    Iterate over bigrams and 1,2-skip-grams.
    '''
    for bigram in nltk.ngrams(sequence, 2):
        yield bigram
    for trigram in nltk.ngrams(sequence, 3):
        yield trigram[0], trigram[2]

根据需要扩展模式('JJ', 'NN')和所需的n-grams。

我认为不需要解析。然而，这种方法的主要问题是，大多数PoS标记器可能不会按照您想要的方式标记所有内容。例如，我的NLTK安装的默认PoS标记符标记“chili”为NN，而不是JJ，并且“fried”得到VBD。不过，解析并不能帮到你！

网友

2楼 · 编辑于 2024-05-16 10:49:52

我用Spacy2.0和英文版。要找到名词和“非名词”来解析输入，然后我将非名词和名词组合在一起以创建所需的输出。

您的意见：

s = ["thai iced tea",
"spicy fried chicken",
"sweet chili pork",
"thai chicken curry",]

Spacy解决方案：

import spacy
nlp = spacy.load('en') # import spacy, load model

def noun_notnoun(phrase):
    doc = nlp(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text

    for notnoun in token_not_noun:
        notnoun_noun_list.append(notnoun + " " + noun)

    return notnoun_noun_list

调用函数：

for phrase in s:
    print(noun_notnoun(phrase))

结果：

['thai tea', 'iced tea']
['spicy chicken', 'fried chicken']
['sweet pork', 'chili pork']
['thai chicken', 'curry chicken']

网友

3楼 · 编辑于 2024-05-16 10:49:52

像这样的：

>>> from nltk import bigrams
>>> text = """thai iced tea
... spicy fried chicken
... sweet chili pork
... thai chicken curry"""
>>> lines = map(str.split, text.split('\n'))
>>> for line in lines:
...     ", ".join([" ".join(bi) for bi in bigrams(line)])
... 
'thai iced, iced tea'
'spicy fried, fried chicken'
'sweet chili, chili pork'
'thai chicken, chicken curry'

或者使用colibricorehttps://proycon.github.io/colibri-core/doc/#installation；p

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用spacy/n生成bi/tri-grams

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >