计算n中频繁词对的Python代码

raw=open("proj.txt","r").read() tokens=nltk.word_tokenize(raw) pairs=nltk.bigrams(tokens) bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = BigramCollocationFinder.from_words(pairs) finder.apply_freq_filter(3) finder.nbest(bigram_measures.pmi, 10)

2条回答

网友

1楼 · 编辑于 2024-05-12 15:28:26

听起来你只需要单词对的列表。如果是这样的话，我想你的意思是使用finder.score_ngrams这样：在

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scores = finder.score_ngrams( bigram_measures.raw_freq )
print scores

还可以使用其他评分标准。听起来你只需要频率，但是其他通用ngram的评分指标在这里-http://nltk.googlecode.com/svn-/trunk/doc/api/nltk.metrics.association.NgramAssocMeasures-class.html

网友

2楼 · 编辑于 2024-05-12 15:28:26

您似乎没有导入就调用了BigramCollocationFinder。正确的路径是nltk.collocations.BigramCollocationFinder。所以你可以试试这个（确保你的文本文件有文本！）公司名称：

>>> import nltk
>>> raw = open('test2.txt').read()
>>> tokens = nltk.word_tokenize(raw)
# or, to exclude punctuation, use something like the following instead of the above line:
# >>> tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(raw)
>>> pairs = nltk.bigrams(tokens)
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> trigram_measures = nltk.collocations.TrigramAssocMeasures()
>>> finder = nltk.collocations.BigramCollocationFinder.from_words(pairs)  # note the difference here!
>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)  # from the Old English text of Beowulf
[(('m\xe6g', 'Higelaces'), ('Higelaces', ',')), (('bearn', 'Ecg\xfeeowes'), ('Ecg\xfeeowes', ':')), (("''", 'Beowulf'), ('Beowulf', 'ma\xfeelode')), (('helm', 'Scyldinga'), ('Scyldinga', ':')), (('ne', 'cu\xfeon'), ('cu\xfeon', ',')), ((',', '\xe6r'), ('\xe6r', 'he')), ((',', 'helm'), ('helm', 'Scyldinga')), ((',', 'bearn'), ('bearn', 'Ecg\xfeeowes')), (('Ne', 'w\xe6s'), ('w\xe6s', '\xfe\xe6t')), (('Beowulf', 'ma\xfeelode'), ('ma\xfeelode', ','))]

相关问题更多 >

编程相关推荐

热门问题

热门文章