NLTK的NgramModel对词的概率始终相同，与上下文无关

1 投票

1 回答

2984 浏览

提问于 2025-04-18 02:58

我正在使用nltk中的NgramModel来计算在一句话中找到某个单词的概率。我的问题是，无论上下文如何，每个单词的概率总是完全相同！下面是一些示例代码，展示了我的问题。

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
from nltk.model import NgramModel

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

lm = NgramModel(3, brown.words(categories='news'), estimator=estimator)
>>> print lm.prob("word", ["This is a context which generates a word"])
0.00493261081006
>>> print lm.prob("word", ["This is a context of a word"])
0.00493261081006
>>> print lm.prob("word", ["This word"])
0.00493261081006
>>> print lm.prob("word", ["word"])
0.00493261081006
>>> print lm.prob("word", ["adnga"])
0.00493261081006

概率计算语言模型上下文无关 ngram模型

1 个回答

上下文中的单词不应该包含它本身，除非你有重复的单词。布朗语料库比较小，所以除非你碰到一个在数据中实际出现过的三元组，否则你得到的答案都是一样的。在我的例子中，我使用的是二元组，这样就不会一直触发平滑模型。在你的例子中，每次都在使用平滑模型。第三，实际上，LidstoneProbDist效果并不好，它是平滑时最简单的解决方案，但在实际应用中并不推荐使用。相比之下，SimpleGoodTuringProbDist要好得多。

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist
from nltk.model import NgramModel

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)

lm = NgramModel(2, brown.words(categories='news'), estimator=estimator)

lm.prob("good", ["very"])          # 0.0024521936223426436
lm.prob("good", ["not"])           # 0.0019510849023145812
lm.prob("good", ["unknown_term"])  # 0.017437821314436573

回答于 2025-04-18 由 Python大师

分享举报

NLTK的NgramModel对词的概率始终相同，与上下文无关

1 个回答

撰写回答