基于Pandas数据框架的文档语料库词数矩阵

2条回答

网友

1楼 · 编辑于 2024-05-17 00:04:10

对于任何不小的文本语料库，我强烈建议使用scikit-learn的CountVectorizer。在

简单到：

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
word_counts = count_vectorizer.fit_transform(corpus) # list of documents (as strings)

它并没有为您提供所需结构中的dataframe，但是使用count_vectorizer的vocabulary_属性来构造它，该属性包含了该项到结果矩阵中其索引的映射。在

网友

2楼 · 编辑于 2024-05-17 00:04:10

使用sklearn的CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer


df = pd.DataFrame({'texts': ["This is one text (the first one)",
                             "This is the second text",
                             "And, finally, a third text"
                            ]})

cv = CountVectorizer()
cv.fit(df['texts'])

results = cv.transform(df['texts'])

print(results.shape) # Sparse matrix, (3, 9)

如果语料库足够小，可以放入您的内存（2000+足够小），您可以将稀疏矩阵转换为pandas数据帧，如下所示：

^{pr2}$

df_res是您想要的结果：

df_res
index and   finally first   is  one second  text    the third   this
0     0     0       1       1   2   0       1       1   0       1
1     0     0       0       1   0   1       1       1   0       1
2     1     1       0       0   0   0       1       0   1       0

如果您得到一个MemoryError，您可以减少单词的词汇表，以考虑使用CountVectorizer的不同参数：

将参数stop_words='english'设置为忽略英文非字词（如the和`and）
使用min_df和{}，这使得CountVectorizer根据文档频率忽略一些单词（太频繁或很少出现的单词，这可能是无用的）
使用max_features，只使用最常见的n单词。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

基于Pandas数据框架的文档语料库词数矩阵

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >