nltk语言模型（ngram）如何根据上下文计算单词概率

16 投票

4 回答

22397 浏览

提问于 2025-04-16 20:11

我正在使用Python和NLTK来构建一个语言模型，代码如下：

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)
# Thanks to miku, I fixed this problem
print lm.prob("word", ["This is a context which generates a word"])
>> 0.00493261081006
# But I got another program like this one...
print lm.prob("b", ["This is a context which generates a word"])

但是它似乎没有正常工作。结果是这样的：

>>> print lm.prob("word", "This is a context which generates a word")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
    return self._alpha(context) * self._backoff.prob(word, context[1:])
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
    return self._alpha(context) * self._backoff.prob(word, context[1:])
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob
    "context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting

有没有人能帮我一下？谢谢！

概率计算 nltk 语言模型上下文分析 ngram

4 个回答

关于你的第二个问题：这是因为 "b" 这个字母在布朗语料库的 news 类别中没有出现，你可以通过以下代码来验证：

>>> 'b' in brown.words(categories='news')
False

而

>>> 'word' in brown.words(categories='news')
True

我承认这个错误信息看起来很复杂，所以你可能想要向NLTK的作者提交一个错误报告。

回答于 2025-04-16 由 Python大师

分享举报

我知道这个问题有点老了，但每次我在谷歌上搜索nltk的NgramModel类时，它总是会出现。NgramModel的概率实现有点让人摸不着头脑。提问者感到困惑。就我所知，回答也不是很好。因为我不常用NgramModel，所以我也会感到困惑。不过现在不再这样了。

源代码在这里：https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py。这里是NgramModel的概率方法的定义：

def prob(self, word, context):
    """
    Evaluate the probability of this word in this context using Katz Backoff.

    :param word: the word to get the probability of
    :type word: str
    :param context: the context the word is in
    :type context: list(str)
    """

    context = tuple(context)
    if (context + (word,) in self._ngrams) or (self._n == 1):
        return self[context].prob(word)
    else:
        return self._alpha(context) * self._backoff.prob(word, context[1:])

(注意：'self[context].prob(word) 等同于 'self._model[context].prob(word)')

好吧。现在我们至少知道该找什么了。'context'需要是什么呢？让我们看看构造函数中的一段代码：

for sent in train:
    for ngram in ingrams(chain(self._lpad, sent, self._rpad), n):
        self._ngrams.add(ngram)
        context = tuple(ngram[:-1])
        token = ngram[-1]
        cfd[context].inc(token)

if not estimator_args and not estimator_kwargs:
    self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
    self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)

好的。构造函数从一个条件频率分布中创建了一个条件概率分布（self._model），它的“上下文”是单个词的元组。这告诉我们，'context' 不能是一个字符串或一个包含单个多词字符串的列表。'context' 必须是一个可迭代的对象，里面包含单个词。实际上，要求还要严格一点。这些元组或列表的大小必须是n-1。想象一下，你告诉它要做一个三元组模型。那么你最好给它适合三元组的上下文。

让我们用一个更简单的例子来看看这个：

>>> import nltk
>>> obs = 'the rain in spain falls mainly in the plains'.split()
>>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
>>> lm.prob('rain', 'the') #wrong
0.0
>>> lm.prob('rain', ['the']) #right
0.5
>>> lm.prob('spain', 'rain in') #wrong
0.0
>>> lm.prob('spain', ['rain in']) #wrong
'''long exception'''
>>> lm.prob('spain', ['rain', 'in']) #right
1.0

（顺便说一下，实际上在NgramModel中用MLE作为估计器去做任何事情都是个坏主意。事情会崩溃。我敢保证。）

至于最初的问题，我想我对提问者想要的内容的最佳猜测是：

print lm.prob("word", "generates a".split())
print lm.prob("b", "generates a".split())

...但这里有太多误解，我根本无法判断他实际上想做什么。

回答于 2025-04-16 由 Python大师

分享举报

快速解决方案：

print lm.prob("word", ["This is a context which generates a word"])
# => 0.00493261081006

回答于 2025-04-16 由 Python大师

分享举报

nltk语言模型（ngram）如何根据上下文计算单词概率

4 个回答

撰写回答