如何用NLTK搭配得到三联图的PMI得分?Python

2024-05-13 01:28:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我知道如何使用NLTK得到bigram和trigram的搭配,我将它们应用到我自己的语料库中。代码如下。

我唯一的问题是如何打印出带有PMI值的birgram?我多次搜索NLTK文档。要么我漏掉了什么要么就不存在了。

import nltk
from nltk.collocations import *

myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))

finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)

Tags: 代码fromimportfindermyfiletrigrampmi语料库
2条回答

我想你在找score_ngram。无论如何,你不需要打印功能。你就自己嚼吧。。。

trigrams = finder.nbest(trigram_measures.pmi, 500000)
print [(x, finder.score_ngram(trigram_measures.pmi, x[0], x[1], x[2])) for x in trigrams]

如果查看nlkt.collocations.TrigramCollocationFinder(请参见http://www.nltk.org/_modules/nltk/collocations.html)的源代码,您会发现它返回一个TrigramCollocationFinder().score_ngrams

def nbest(self, score_fn, n):
    """Returns the top n ngrams when scored by the given function."""
    return [p for p,s in self.score_ngrams(score_fn)[:n]]

因此您可以直接调用score_ngrams(),而不必获取nbest,因为它无论如何都会返回一个排序列表

def score_ngrams(self, score_fn):
    """Returns a sequence of (ngram, score) pairs ordered from highest to
    lowest score, as determined by the scoring function provided.
    """
    return sorted(self._score_ngrams(score_fn),
                  key=_itemgetter(1), reverse=True)

尝试:

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))

for i in finder.score_ngrams(trigram_measures.pmi):
    print i

[输出]:

(('this', 'is', 'a'), 9.047123912114026)
(('is', 'a', 'foo'), 7.46216141139287)
(('black', 'sheep', 'shep'), 5.46216141139287)
(('black', 'sheep', 'foo'), 4.877198910671714)
(('a', 'foo', 'bar'), 4.462161411392869)
(('sheep', 'shep', 'bar'), 4.462161411392869)
(('bar', 'black', 'sheep'), 4.047123912114026)
(('bar', 'black', 'sentence'), 4.047123912114026)
(('sheep', 'foo', 'bar'), 3.877198910671714)
(('bar', 'bar', 'black'), 3.047123912114026)
(('foo', 'bar', 'bar'), 3.047123912114026)
(('shep', 'bar', 'bar'), 3.047123912114026)

相关问题 更多 >