如何在scikitlearn CountVectoriz中计算术语频率

2024-06-01 03:17:06 发布

男 | 程序猿一只，喜欢编程写python代码。

我不明白CountVectorizer如何计算术语频率。我需要知道这一点，以便在从语料库中筛选出术语时，可以对max_df参数做出明智的选择。下面是示例代码：

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(min_df = 1, max_df = 0.9)
    X = vectorizer.fit_transform(['afr bdf dssd','afr bdf c','afr'])
    word_freq_df = pd.DataFrame({'term': vectorizer.get_feature_names(), 'occurrences':np.asarray(X.sum(axis=0)).ravel().tolist()})
    word_freq_df['frequency'] = word_freq_df['occurrences']/np.sum(word_freq_df['occurrences'])
    print word_freq_df.sort('occurrences',ascending = False).head()

       occurrences  term  frequency
    0            3   afr   0.500000
    1            2   bdf   0.333333
    2            1  dssd   0.166667

似乎“afr”出现在我的语料库中的一半术语中，正如我通过查看语料库所期望的那样。然而，当我在CountVectorizer中设置max_df = 0.8时，术语“afr”会从我的语料库中过滤掉。在我的示例中，我发现对于coprus，CountVectorizer将~0.833的频率指定给“afr”。有人能提供一个公式，说明如何计算输入max_df的项频率吗？

谢谢

Tags： import 示例 df np max word 频率术语

0条回答

目前没有回答

如何在scikitlearn CountVectoriz中计算术语频率

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在scikitlearn CountVectoriz中计算术语频率

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >