理解NLTK中二元组和三元组的搭配评分

27 投票

1 回答

27126 浏览

数据工程师

提问于 2025-04-17 09:17

背景：

我正在尝试比较一对对单词，看看哪一对在美国英语中“更可能出现”。我的计划是使用NLTK中的搭配功能来给单词对打分，得分高的那一对就是更可能的。

方法：

我用Python编写了以下代码，使用了NLTK（为了简洁，省略了一些步骤和导入）：

bgm    = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio  )
print scored

结果：

然后我用两个单词对来检查结果，其中一个应该是很可能同时出现的，另一个则不应该（“烤腰果”和“汽油腰果”）。我很惊讶地发现这两个单词对的得分是一样的：

[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]

我本以为“烤腰果”的得分会比“汽油腰果”高。

问题：

我是不是误解了搭配的用法？
我的代码有问题吗？
我认为得分应该不同的假设是错的吗？如果是，为什么呢？

非常感谢任何信息或帮助！

nltk natural language processing statistical analysis bigram collocation trigram word pairing scoring method

1 个回答

NLTK的搭配文档看起来挺不错的。你可以在这里查看：http://www.nltk.org/howto/collocations.html

你需要给评分器提供一些实际的、比较大的文本数据来处理。下面是一个使用NLTK内置的布朗语料库的示例。运行大约需要30秒。

import nltk.collocations
import nltk.corpus
import collections

bgm    = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio  )

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
   prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
   prefix_keys[key].sort(key = lambda x: -x[1])

print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]

输出结果看起来还不错，对于“棒球”这个词效果很好，但对于“医生”和“快乐”这两个词的效果就差一些。

doctor [('bills', 35.061321987405748), (',', 22.963930079491501), 
  ('annoys', 19.009636692022365), 
  ('had', 16.730384189212423), ('retorted', 15.190847940499127)]

baseball [('game', 32.110754519752291), ('cap', 27.81891372457088), 
  ('park', 23.509042621473505), ('games', 23.105033513054011), 
  ("player's",    16.227872863424668)]

happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589), 
 ('family', 13.734352182441569), 
 (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)

回答于 2025-04-17 由 Python大师

分享举报

理解NLTK中二元组和三元组的搭配评分

1 个回答

撰写回答