在Python中使用gensim进行LSI

4 投票

3 回答

7623 浏览

提问于 2025-04-16 19:13

我正在使用Python的gensim库来进行潜在语义索引（latent semantic indexing）。我按照网站上的教程操作，效果还不错。现在我想稍微修改一下；我希望每次添加文档时都能运行一次lsi模型。

这是我的代码：

stoplist = set('for a of the and to in'.split())
num_factors=3
corpus = []

for i in range(len(urls)):
 print "Importing", urls[i]
 doc = getwords(urls[i])
 cleandoc = [word for word in doc.lower().split() if word not in stoplist]
 if i == 0:
  dictionary = corpora.Dictionary([cleandoc])
 else:
  dictionary.addDocuments([cleandoc])
 newVec = dictionary.doc2bow(cleandoc)
 corpus.append(newVec)
 tfidf = models.TfidfModel(corpus)
 corpus_tfidf = tfidf[corpus]
 lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
 corpus_lsi = lsi[corpus_tfidf]

geturls是我写的一个函数，它返回一个网站的内容，格式是字符串。再说一次，如果我等到处理完所有文档后再进行tfidf和lsi，那是可以的，但我并不想这样。我希望在每次迭代时都能进行处理。不幸的是，我遇到了这个错误：

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "streamlsa.py", line 51, in <module>
    lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 303, in __init__
    self.addDocuments(corpus)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 365, in addDocuments
    self.printTopics(5) # TODO see if printDebug works and remove one of these..
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 441, in printTopics
    self.printTopic(i, topN = numWords)))
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 433, in printTopic
    return ' + '.join(['%.3f*"%s"' % (1.0 * c[val] / norm, self.id2word[val]) for val in most])
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/corpora/dictionary.py", line 52, in __getitem__
    return self.id2token[tokenid] # will throw for non-existent ids
KeyError: 1248

通常这个错误在处理第二个文档时出现。我觉得我明白它在告诉我什么（字典索引有问题），但我就是搞不清楚为什么会这样。我尝试了很多不同的方法，但似乎都没有效果。有没有人知道发生了什么？

谢谢！

error handling document processing natural language processing iterative processing tfidf gensim latent semantic indexing semantic analysis

3 个回答

在doc2bow这个函数里，你可以把allow_update设置为True，这样它就会在每次运行doc2bow的时候自动更新你的字典。

http://radimrehurek.com/gensim/corpora/dictionary.html

回答于 2025-04-16 由 Python大师

分享举报

好的，我找到了一种解决办法，虽然不是最优的。

如果你用 corpora.Dictionary 创建一个字典，然后立刻用 dictionary.addDocuments 添加文档，一切都能正常工作。

但是，如果你在这两个步骤之间使用字典（比如调用 dictionary.doc2bow 或者把字典附加到一个 lsi 模型中，使用 id2word），那么你的字典就会“冻结”，无法更新。你可以调用 dictionary.addDocuments，它会告诉你已经更新了，并且还会告诉你新字典的大小，比如：

INFO:dictionary:built Dictionary(6627 unique tokens) from 8 documents (total 24054 corpus positions)

但是当你引用任何新索引时，就会出现错误。我不确定这是个bug还是故意这样（不管出于什么原因），但至少 gensim 报告成功添加文档到字典这一点肯定是个bug。

我最开始尝试把字典的调用放在不同的函数里，这样只有字典的局部副本会被修改。结果，还是出错了。这让我觉得很奇怪，我也不知道为什么。

接下来我尝试用 copy 字典的副本，使用 copy.copy。这样可以，但显然会多占用一些资源。不过，这样可以让你保持一个可用的语料库和字典副本。不过对我来说，最大的缺点是，这个解决办法不允许我用 filterTokens 删除那些只出现一次的词，因为那样就需要修改字典。

我的另一个解决办法是每次迭代时重新构建所有东西（语料库、字典、lsi 和 tfidf 模型）。在我小样本的数据集上，这样能给我稍微更好的结果，但对于非常大的数据集来说，这样做会遇到内存问题。不过，目前我就是这么做的。

如果有经验的 gensim 用户有更好的（且更节省内存的）解决办法，能让我在处理更大数据集时不遇到问题，请告诉我！

回答于 2025-04-16 由 Python大师

分享举报

这是gensim中的一个错误，具体来说是反向的ID到单词的映射被缓存了，但在调用addDocuments()之后，这个缓存没有更新。

这个问题在2011年的一次更新中被修复了，具体可以查看这个链接：https://github.com/piskvorky/gensim/commit/b88225cfda8570557d3c72b0820fefb48064a049。

回答于 2025-04-16 由 Python大师

分享举报

在Python中使用gensim进行LSI

3 个回答

撰写回答