如何在gensim中打印LDA主题模型?Python
我使用 gensim
从一组文档中提取了主题,采用的是 LSA 方法,但我想知道如何访问 LDA 模型生成的主题。
当我打印 lda.print_topics(10)
时,出现了一个错误,因为 print_topics()
返回的是 NoneType
,也就是没有返回任何内容:
Traceback (most recent call last):
File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable
代码如下:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# I can print out the topics for LSA
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus]
for l,t in izip(corpus_lsi,corpus):
print l,"#",t
print
for top in lsi.print_topics(2):
print top
# I can print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
for l,t in izip(corpus_lda,corpus):
print l,"#",t
print
# But I am unable to print out the topics, how should i do it?
for top in lda.print_topics(10):
print top
10 个回答
9
我觉得把主题以单词列表的形式展示会更有帮助。下面的代码片段可以实现这个目标。我假设你已经有一个叫做 lda_model
的模型。
for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))
在上面的代码中,我决定显示每个主题的前30个单词。为了简单起见,我只展示了我得到的第一个主题。
Topic: 0
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']
我不太喜欢上面主题的展示方式,所以我通常会把我的代码修改成这样:
for idx, topic in lda_model.show_topics(formatted=False, num_words= 30):
print('Topic: {} \nWords: {}'.format(idx, '|'.join([w[0] for w in topic])))
... 然后输出的结果(显示前两个主题)看起来会是这样的。
Topic: 0
Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental
Topic: 1
Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head
12
我觉得show_topics的语法随着时间有了变化:
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
对于num_topics这个主题数量,它会返回num_words个最重要的词(默认每个主题10个词)。
主题的返回形式是一个列表——如果格式化设置为True,就返回一个字符串列表;如果设置为False,则返回一个包含(概率, 词)的二元组列表。
如果log设置为True,还会把这个结果输出到日志中。
与LSA不同,在LDA中,主题之间没有自然的顺序。因此,返回的num_topics <= self.num_topics的主题子集是任意的,可能在两次LDA训练之间发生变化。
21
经过一番折腾,我发现ldamodel
里的print_topics(numoftopics)
好像有点问题。所以我想了个办法,改用print_topic(topicid)
来解决:
>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>> print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...