Gensim主题打印错误/问题

0 投票
2 回答
982 浏览
提问于 2025-04-17 18:12

大家好,

这是我在这个帖子中回复的内容的重发。我在尝试打印gensim中的LSI主题时,得到了些完全不对劲的结果。以下是我的代码:

try:
    from gensim import corpora, models
except ImportError as err:
    print err

class LSI:
    def topics(self, corpus):
        tfidf = models.TfidfModel(corpus)
        corpus_tfidf = tfidf[corpus]
        dictionary = corpora.Dictionary(corpus)
        lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
        print lsi.show_topics()

if __name__ == '__main__':
    data = '../data/data.txt'
    corpus = corpora.textcorpus.TextCorpus(data)
    LSI().topics(corpus)

这段代码在控制台上打印出了以下内容。

-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)" + ......

我希望能够像@2er0在这里那样打印出主题,但我得到的结果却是这样的。请看下面,注意第二个打印出来的项目是一个元组,我完全不知道它是从哪里来的。data.txt是一个包含几段文字的文本文件,仅此而已。

如果有人对此有任何想法,那就太好了!亚当

2 个回答

0

虽然看起来不太好,但这个方法能完成任务(这只是一个纯粹基于字符串的方法):

#x = lsi.show_topics()
x = '-0.804*"(5, 1)" + -0.246*"(856, 1)" + -0.227*"(145, 1)"'
y = [(j.split("*")[0], (j.split("*")[1].split(",")[0].lstrip('"('), j.split("*")[1].split(",")[1].strip().rstrip(')"'))) for j in [i for i in x.strip().split(" + ")]]

for i in y:
  print y

上面的代码输出:

('-0.804', ('5', '1'))
('-0.246', ('856', '1'))
('-0.227', ('145', '1'))

如果不行,你可以试试用 lsi.print_topic(i) 来代替 lsi.show_topics()。

for i in range(len(lsi.show_topics())):
  print lsi.print_topic(i)
4

要解释为什么你的LSI主题是元组而不是单词,先检查一下你的输入语料库。

它是通过 corpus = [dictionary.doc2bow(text) for text in texts] 从一系列文档创建的吗?

因为如果不是,而你只是从序列化的语料库中读取数据,而没有读取字典,那么你在主题输出中就不会看到单词。

下面我的代码可以正常工作,并打印出带有权重的单词主题:

import gensim as gs

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gs.corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = gs.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=5)
lsi.print_topics()

for i in lsi.print_topics():
    print i

上面的输出是:

-0.331*"system" + -0.329*"a" + -0.329*"survey" + -0.241*"user" + -0.234*"minors" + -0.217*"opinion" + -0.215*"eps" + -0.212*"graph" + -0.205*"response" + -0.205*"time"
-0.330*"minors" + 0.313*"eps" + 0.301*"system" + -0.288*"graph" + -0.274*"a" + -0.274*"survey" + 0.268*"management" + 0.262*"interface" + 0.208*"human" + 0.189*"engineering"
0.282*"trees" + 0.267*"the" + 0.236*"in" + 0.236*"paths" + 0.236*"intersection" + -0.233*"time" + -0.233*"response" + 0.202*"generation" + 0.202*"unordered" + 0.202*"binary"
-0.247*"generation" + -0.247*"unordered" + -0.247*"random" + -0.247*"binary" + 0.219*"minors" + -0.214*"the" + -0.214*"to" + -0.214*"error" + -0.214*"perceived" + -0.214*"relation"
0.333*"machine" + 0.333*"for" + 0.333*"lab" + 0.333*"abc" + 0.333*"applications" + 0.258*"computer" + -0.214*"system" + -0.194*"eps" + -0.191*"and" + -0.188*"testing"

撰写回答