使用LDA输出追加/合并数据帧

2024-03-28 02:16:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Gensimspacy开发一个LDA模型

一般而言:

ldamodel = Lda(doc_term_matrix, num_topics=4, random_state = 100, update_every=3, chunksize = 50, id2word = dictionary, passes=100, alpha='auto')
ldamodel.print_topics(num_topics=4, num_words=6)

我现在有了一些输出,我想在原始数据框(文本来自该数据框)中添加主题和每个文档的贡献百分比

原来的df是这样的

id  group text
234 1     here is some text
837 7     here is some text
494 2     here is some text
223 1     here is some text

我做一些标准的预处理,包括柠檬化、删除停止词等,然后计算每个文档的贡献百分比

我的输出如下所示

   Document_No  Dominant_Topic  ...                                           Keywords Text
0            0             1.0  ...  RT, new, work, amp, year, today, people, look,...    0
1            1             0.0  ...  like, time, good, know, day, find, research, a...    1
2            2             1.0  ...  RT, new, work, amp, year, today, people, look,...    2
3            3             3.0  ...  study, t, change, use, want, Trump, love, stud...    3
4            4             3.0  ...  study, t, change, use, want, Trump, love, stud...    4

我想我可以这样把两个dfs连接起来:

results = pd.concat([df, results])

但是当我这样做的时候,指数不匹配,我留下了一种类似于弗兰肯斯坦的df,看起来像这样

id  group text                Document_No  Dominant_Topic  ...                                           
NaN NaN   NaN                 0            1.0             ...
NaN NaN   NaN                 1            0.0             ...
494 2     here is some text   NaN          NaN             ...
223 1     here is some text   NaN          NaN             ...

如果有帮助的话,我很乐意发布更完整的代码,但我希望有人知道一种更好的方法来完成这项工作,就像我打印主题一样


Tags: text文档iddf主题hereisgroup