特定词的NLTK搭配

import nltk from nltk.collocations import * from nltk.tokenize import word_tokenize text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(word_tokenize(text)) for i in finder.score_ngrams(trigram_measures.pmi): print i

3条回答

网友

1楼 · 编辑于 2024-05-13 09:48:16

至于问题2，是的！NLTK在其关联测度中具有似然比。第一个问题仍然没有答案！

http://nltk.org/api/nltk.metrics.html?highlight=likelihood_ratio#nltk.metrics.association.NgramAssocMeasures.likelihood_ratio

网友

2楼 · 编辑于 2024-05-13 09:48:16

请尝试以下代码：

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# Ngrams with 'creature' as a member
creature_filter = lambda *w: 'creature' not in w


## Bigrams
finder = BigramCollocationFinder.from_words(
   nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# only bigrams that contain 'creature'
finder.apply_ngram_filter(creature_filter)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.likelihood_ratio, 10)


## Trigrams
finder = TrigramCollocationFinder.from_words(
   nltk.corpus.genesis.words('english-web.txt'))
# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
# only trigrams that contain 'creature'
finder.apply_ngram_filter(creature_filter)
# return the 10 n-grams with the highest PMI
print finder.nbest(trigram_measures.likelihood_ratio, 10)

它使用似然度量，并过滤掉不包含“生物”一词的ngram

网友

3楼 · 编辑于 2024-05-13 09:48:16

问题1-尝试：

target_word = "electronic" # your choice of word
finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3))
for i in finder.score_ngrams(trigram_measures.likelihood_ratio):
print i

我们的想法是过滤掉你不想要的东西。这种方法通常用于过滤ngram特定部分中的单词，您可以根据自己的心意对其进行调整。

相关问题更多 >

编程相关推荐

热门问题

热门文章