如何在Gensim字典中输入由不同符号组成的序列/列表？

2024-04-28 02:43:53 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个pandas数据帧，其中有一列包含会话数据。我按以下方式对其进行了预处理：

def preprocessing(text):
     return [word for word in simple_preprocess(str(text), min_len = 2, deacc = True) if word not in stop_words]

dataset['preprocessed'] = dataset.apply(lambda row: preprocessing(row['msgText']), axis = 1)

为了使它成为一维，我使用了（两个）：

^{2}$

以及：

processed_docs = data['preprocessed'].tolist()

现在看起来如下：

>>> processed_docs[:2]
0    ['klinkt', 'alsof', 'zwaar', 'dingen', 'spelen...
1    ['waar', 'liefst', 'meedenk', 'betekenen', 'pe...

对于这两种情况，我都使用了：

dictionary = gensim.corpora.Dictionary(processed_docs)

但是，在这两种情况下，我都得到了错误：

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

我怎样才能修改我的数据，这样我就不会得到这个类型错误了？在

鉴于之前有人问过类似的问题，我考虑过：

Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string

根据第一个答案，我尝试了以下解决方案：

dictionary = gensim.corpora.Dictionary([processed_docs.split()])

得到错误（/s）：

AttributeError: 'Series'('List') object has no attribute 'split'

在第二个答案中，有人说输入必须是令牌，这对我来说已经足够了。在

此外，基于（TypeError: doc2bow expects an array of unicode tokens on input, not a single string when using gensim.corpora.Dictionary()），我使用了如上所述的.tolist()方法，这也不起作用。在

Tags： of 数据 an docs dictionary 错误 not array

1条回答

网友

1楼 · 发布于 2024-04-28 02:43:53

我认为你需要：

dictionary = gensim.corpora.Dictionary([processed_docs[:]]) ；

和13；

遍历集合。您可以编写[2:]从2开始，然后迭代到结尾，或者[：7]从0开始，然后转到7或[2:7]。您也可以尝试[：len（已处理的_docs）]

我希望这有帮助：）

如何在Gensim字典中输入由不同符号组成的序列/列表？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在Gensim字典中输入由不同符号组成的序列/列表？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >