NLTK BigramCollocationFinder返回的总二元组计数是多少？

3 投票

1 回答

2062 浏览

提问于 2025-04-18 08:57

我正在尝试用自己的代码来复现一些常见的自然语言处理（NLP）指标，包括Manning和Scheutze的搭配显著性t检验，以及卡方检验。

我对以下24个词语的列表使用了nltk.bigrams()：

tokens = ['she', 'knocked', 'on', 'his', 'door', 'she', 'knocked', 'at', 
'the', 'door','100', 'women', 'knocked', 'on', "Donaldson's", 'door', 'a', 
'man', 'knocked', 'on', 'the', 'metal', 'front', 'door']`

我得到了23个二元组（bigrams）：

[('she', 'knocked'), ('knocked', 'on'), ('on', 'his'), ('his', 'door'), ('door', 'she'), 
('she', 'knocked'), ('knocked', 'at'), ('at', 'the'), ('the', 'door'), ('door', '100'), 
('100', 'women'), ('women', 'knocked'), ('knocked', 'on'), ('on', "Donaldson's"), 
("Donaldson's", 'door'), ('door', 'a'), ('a', 'man'), ('man', 'knocked'),
('knocked', 'on'), ('on', 'the'), ('the', 'metal'), ('metal', 'front'), ('front',    
'door')]`

如果我想计算 ('she', 'knocked') 的t统计量，我输入：

#Total bigrams is 23
t = (2/23 - 4/23)/(math.sqrt(2/23/23))`
t = 1.16826337761`

但是，当我尝试：

finder = BigramCollocationFinder.from_words(tokens)`
student_t = finder.score_ngrams(bigram_measures.student_t)`
student_t = (('she', 'knocked'), 1.178511301977579)`

当我把我的二元组总数设置为24（也就是原始词语列表的长度）时，我得到了和NLTK一样的结果：

('she', 'knocked'): 1.17851130198

我的问题其实很简单：在进行这些假设检验时，我应该用什么作为总体数量？是用词语列表的长度，还是用二元组列表的长度？或者说，这个过程是否会计算一个在nltk.bigram()方法中没有输出的终端单位？

二元组自然语言处理统计分析 nltk t检验卡方检验搭配显著性词语列表

1 个回答

首先，我们从nltk.collocations.BigramCollocationFinder中找出score_ngram()这个函数。可以查看这个链接了解更多：https://github.com/nltk/nltk/blob/develop/nltk/collocations.py:

def score_ngram(self, score_fn, w1, w2):
    """Returns the score for a given bigram using the given scoring
    function.  Following Church and Hanks (1990), counts are scaled by
    a factor of 1/(window_size - 1).
    """
    n_all = self.word_fd.N()
    n_ii = self.ngram_fd[(w1, w2)] / (self.window_size - 1.0)
    if not n_ii:
        return
    n_ix = self.word_fd[w1]
    n_xi = self.word_fd[w2]
    return score_fn(n_ii, (n_ix, n_xi), n_all)

接下来，我们看看nltk.metrics.association中的student_t()函数，详细信息可以在这里找到：https://github.com/nltk/nltk/blob/develop/nltk/metrics/association.py:

### Indices to marginals arguments:

NGRAM = 0
"""Marginals index for the ngram count"""

UNIGRAMS = -2
"""Marginals index for a tuple of each unigram count"""

TOTAL = -1
"""Marginals index for the number of words in the data"""

def student_t(cls, *marginals):
      """Scores ngrams using Student's t test with independence hypothesis
      for unigrams, as in Manning and Schutze 5.3.1.
      """
      return ((marginals[NGRAM] -
                _product(marginals[UNIGRAMS]) /
                float(marginals[TOTAL] ** (cls._n - 1))) /
              (marginals[NGRAM] + _SMALL) ** .5)

还有_product()和_SMALL的内容是：

_product = lambda s: reduce(lambda x, y: x * y, s)
_SMALL = 1e-20

回到你的例子：

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

tokens = ['she', 'knocked', 'on', 'his', 'door', 'she', 'knocked', 'at', 
'the', 'door','100', 'women', 'knocked', 'on', "Donaldson's", 'door', 'a', 
'man', 'knocked', 'on', 'the', 'metal', 'front', 'door']

finder = BigramCollocationFinder.from_words(tokens)
bigram_measures = BigramAssocMeasures()
print finder.word_fd.N()

student_t = {k:v for k,v in finder.score_ngrams(bigram_measures.student_t)}
print student_t['she', 'knocked']

[输出]:

24
1.17851130198

在NLTK中，它把标记的数量当作总数，也就是24。不过我觉得这并不是通常计算student_t测试分数的方法。我会选择用#Ngrams而不是#Tokens，具体可以参考nlp.stanford.edu/fsnlp/promo/colloc.pdf和www.cse.unt.edu/~rada/CSCE5290/Lectures/Collocations.ppt。不过因为总数是一个常数，当#Tokens的数量非常大时，我不太确定这种差异的效果大小是否有意义，因为对于二元组来说，#Tokens = #Ngrams + 1。

让我们继续深入了解NLTK是如何计算student_t的。如果我们把student_t()这个函数去掉，只放入参数，我们会得到相同的输出：

import math

NGRAM = 0
"""Marginals index for the ngram count"""

UNIGRAMS = -2
"""Marginals index for a tuple of each unigram count"""

TOTAL = -1
"""Marginals index for the number of words in the data"""

_product = lambda s: reduce(lambda x, y: x * y, s)
_SMALL = 1e-20

def student_t(*marginals):
    """Scores ngrams using Student's t test with independence hypothesis
    for unigrams, as in Manning and Schutze 5.3.1.
    """
    _n = 2
    return ((marginals[NGRAM] -
                _product(marginals[UNIGRAMS]) /
                float(marginals[TOTAL] ** (_n - 1))) /
              (marginals[NGRAM] + _SMALL) ** .5)

ngram_freq = 2
w1_freq = 2
w2_freq = 4
total_num_words = 24

print student_t(ngram_freq, (w1_freq,w2_freq), total_num_words)

所以我们看到在NLTK中，二元组的student_t分数是这样计算的：

import math
(2 - 2*4/float(24)) / math.sqrt(2 + 1e-20)

用公式表示：

(ngram_freq - (w1_freq * w2_freq) / total_num_words) / sqrt(ngram_freq + 1e-20)

回答于 2025-04-18 由 Python大师

分享举报

NLTK BigramCollocationFinder返回的总二元组计数是多少？

1 个回答

撰写回答