如何在ngram计数后在dataframe中添加额外的列

import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer df = pd.read_csv("F:/textclustering/data/filteredtext1.csv", encoding="iso-8859-1" ,low_memory=False) document = df['Data'] vectorizer = CountVectorizer(ngram_range=(2, 2)) X = vectorizer.fit_transform(document) matrix_terms = np.array(vectorizer.get_feature_names()) matrix_freq = np.asarray(X.sum(axis=0)).ravel() terms = vectorizer.get_feature_names() freqs = X.sum(axis=0).A1 dictionary = dict(zip(terms, freqs)) df = pd.DataFrame(dictionary,index=[0]).T.reindex() df.to_csv("F:/textclustering/data/terms2.csv", sep=',', na_rep="none")

1条回答

网友

1楼 · 发布于 2024-04-23 18:23:45

首先，我们要把文档转换成csr稀疏矩阵，然后再转换成coo矩阵。COO矩阵允许您获得稀疏元素的行和列的位置。在

from itertools import groupby
from sklearn.feature_extraction.text import CountVectorizer

ls = [['example text is great', 1],
      ['this is great', 2], 
      ['example text is great', 3]]
document = [l[0] for l in ls]
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(document)
X = X.tocoo()

然后您可以按列分组（对于您拥有的每个bigram）。这里有一个小技巧，首先必须按列对元组进行排序。然后，对于每一行，可以用您的bigram替换行中的索引。我使用dictionary nameid2vocab创建映射

^{pr2}$

输出如下所示

[[0, 'example text', 2, [0, 2]],
 [1, 'is great', 3, [0, 1, 2]],
 [2, 'text is', 2, [0, 2]],
 [3, 'this is', 1, [1]]]

相关问题更多 >

编程相关推荐

热门问题

热门文章