用“常用短语袋”找出不寻常的短语

2条回答

网友

1楼 · 编辑于 2024-04-20 11:14:51

你可以用Gensim Phrase (collocation) detection来查找句子中的常用短语

但是如果你想检测不寻常的短语，你可以用正则表达式来描述一些词性组合模式，并在输入句子上做词性标记，你就能提取出与你的模式相匹配的不可见词（短语）。在

网友

2楼 · 编辑于 2024-04-20 11:14:51

为此，您可以构建一个简单的“语言模型”。它将估计一个短语的概率，并将平均每词概率较低的短语标记为异常。在

对于词的概率估计，它可以使用平滑的字数。在

模型的外观如下：

import re
import numpy as np
from collections import Counter

class LanguageModel:
    """ A simple model to measure 'unusualness' of sentences. 
    delta is a smoothing parameter. 
    The larger delta is, the higher is the penalty for unseen words.
    """
    def __init__(self, delta=0.01):
        self.delta = delta
    def preprocess(self, sentence):
        words = sentence.lower().split()
        return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
    def fit(self, corpus):
        """ Estimate counts from an array of texts """
        self.counter_ = Counter(word 
                                 for sentence in corpus 
                                 for word in self.preprocess(sentence))
        self.total_count_ = sum(self.counter_.values())
        self.vocabulary_size_ = len(self.counter_.values())
    def perplexity(self, sentence):
        """ Calculate negative mean log probability of a word in a sentence 
        The higher this number, the more unusual the sentence is.
        """
        words = self.preprocess(sentence)
        mean_log_proba = 0.0
        for word in words:
            # use a smoothed version of "probability" to work with unseen words
            word_count = self.counter_.get(word, 0) + self.delta
            total_count = self.total_count_ + self.vocabulary_size_ * self.delta
            word_probability = word_count / total_count
            mean_log_proba += np.log(word_probability) / len(words)
        return -mean_log_proba

    def relative_perplexity(self, sentence):
        """ Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
        return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)

    @property
    def max_perplexity(self):
        """ Perplexity of an unseen word """
        return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))

    @property
    def min_perplexity(self):
        """ Perplexity of the most likely word """
        return self.perplexity(self.counter_.most_common(1)[0][0])

你可以训练这个模型并把它应用到不同的句子中。在

^{pr2}$

哪一个印在你身上

8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet

你可以看到第一个短语的“不寻常”比第二个短语高，因为第二个短语是由训练词组成的。在

如果你的“常用”短语的语料库足够大，你可以从我使用的1-gram模型切换到N-gram（对于英语，sensible N是2或3）。或者，你可以使用前一个词的循环概率来预测所有的条件神经网络。但这需要一个非常庞大的训练语料库。在

如果你使用一种高度灵活的语言，比如土耳其语，你可以使用字符级N-grams来代替单词级模型，或者只是使用NLTK中的lemmatization算法对文本进行预处理。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

用“常用短语袋”找出不寻常的短语

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >