列表后跟另一项的条件概率NLTK

2024-04-25 06:49:51 发布

您现在位置:Python中文网/ 问答频道 /正文

不需要代码 我在检查给定一系列单词的概率,在这个系列之后,索引就是某个给定的单词。我目前正在使用nltk/python,我想知道是否有一个简单的函数来实现这一点,或者我是否需要自己通过迭代和计算所有出现次数来硬编码这类事情。你知道吗

谢谢


Tags: 函数代码编码概率单词事情次数nltk
1条回答
网友
1楼 · 发布于 2024-04-25 06:49:51

你必须先迭代整个文本,然后计算n-gram,这样你就可以在给定一个序列的情况下计算它们的概率。你知道吗

下面是一个非常简单的例子:

import re
from collections import defaultdict, Counter

# Tokenize the text in a very naive way.
text = "The Maroon Bells are a pair of peaks in the Elk Mountains of Colorado, United States, close to the town of Aspen. The two peaks are separated by around 500 meters (one-third of a mile). Maroon Peak is the higher of the two, with an altitude of 14,163 feet (4317.0 m), and North Maroon Peak rises to 14,019 feet (4273.0 m), making them both fourteeners. The Maroon Bells are a popular tourist destination for day and overnight visitors, with around 300,000 visitors every season."
tokens = re.findall(r"\w+", text.lower(), re.U)


def get_ngram_mapping(tokens, n):
    # Add markers for the beginning and end of the text.
    tokens = ["[BOS]"] + tokens + ["[EOS]"]

    # Map a preceding sequence of n-1 tokens to a list
    # of following tokens. 'defaultdict' is used to
    # give us an empty list when we acces a key that
    # does not exist yet.
    ngram_mapping = defaultdict(list)

    # Iterate through the text using a moving window
    # of length n.
    for i in range(len(tokens) - n + 1):
        window = tokens[i:i+n]
        preceding_sequence = tuple(window[:-1])
        following_token = window[-1]

        # Example for n=3: 'it is good' =>
        # ngram_mapping[("it", "is")] = ["good"]
        ngram_mapping[preceding_sequence].append(following_token)

    return ngram_mapping


def compute_ngram_probability(ngram_mapping):
    ngram_probability = {}
    for preceding, following in ngram_mapping.items():
        # Let's count which tokens appear right
        # behind the tokens in the preceding sequence.
        # Example: Counter(['a', 'a', 'b'])
        # => {'a': 2, 'b': 1}
        token_counts = Counter(following)

        # Next we compute the probability that
        # a token 'w' follows our sequence 's'
        # by dividing by the frequency of 's'.
        frequency_s = len(following)

        token_probability = defaultdict(float)
        for token, token_frequency in token_counts.items():
            token_probability[token] = token_frequency / frequency_s

        ngram_probability[preceding] = token_probability


    return ngram_probability

ngrams = count_ngrams(tokens, n=2)
ngram_probability = compute_ngram_probability(ngrams)

print(ngram_probability[("the",)]["elk"])  # = 0.14285714285714285
print(ngram_probability[("the",)]["unknown"]) # = 0.0

相关问题 更多 >