我正在使用Gensim
和spacy
开发一个LDA模型
一般而言:
ldamodel = Lda(doc_term_matrix, num_topics=4, random_state = 100, update_every=3, chunksize = 50, id2word = dictionary, passes=100, alpha='auto')
ldamodel.print_topics(num_topics=4, num_words=6)
我现在有了一些输出,我想在原始数据框(文本来自该数据框)中添加主题和每个文档的贡献百分比
原来的df是这样的
id group text
234 1 here is some text
837 7 here is some text
494 2 here is some text
223 1 here is some text
我做一些标准的预处理,包括柠檬化、删除停止词等,然后计算每个文档的贡献百分比
我的输出如下所示
Document_No Dominant_Topic ... Keywords Text
0 0 1.0 ... RT, new, work, amp, year, today, people, look,... 0
1 1 0.0 ... like, time, good, know, day, find, research, a... 1
2 2 1.0 ... RT, new, work, amp, year, today, people, look,... 2
3 3 3.0 ... study, t, change, use, want, Trump, love, stud... 3
4 4 3.0 ... study, t, change, use, want, Trump, love, stud... 4
我想我可以这样把两个dfs连接起来:
results = pd.concat([df, results])
但是当我这样做的时候,指数不匹配,我留下了一种类似于弗兰肯斯坦的df,看起来像这样
id group text Document_No Dominant_Topic ...
NaN NaN NaN 0 1.0 ...
NaN NaN NaN 1 0.0 ...
494 2 here is some text NaN NaN ...
223 1 here is some text NaN NaN ...
如果有帮助的话,我很乐意发布更完整的代码,但我希望有人知道一种更好的方法来完成这项工作,就像我打印主题一样
目前没有回答
相关问题 更多 >
编程相关推荐