为什么`gensim`中的tf-idf模型在我转换语料库后丢弃了术语和计数？

2 投票

1 回答

2125 浏览

提问于 2025-04-17 16:50

为什么在使用gensim的tf-idf模型时，我转换语料库后，模型会丢掉一些词和计数呢？

我的代码：

from gensim import corpora, models, similarities

# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]

# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)

# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
  print d
print 

# To convert the corpus into tfidf, re-initialize the corpus 
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]

for d in corpus:
  print d

输出结果：

[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]

[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]

文本处理语料库 tf-idf 词频

1 个回答

IDF是通过把所有文档的总数除以包含某个词的文档数量，然后对这个结果取对数来计算的。在你的例子中，所有文档都有term0这个词，所以term0的IDF就是log(1)，也就是0。因此，在你的文档-词矩阵中，term0这一列全是零。

一个出现在所有文档中的词，它的权重是零，完全没有任何信息价值。

回答于 2025-04-17 由 Python大师

分享举报

为什么`gensim`中的tf-idf模型在我转换语料库后丢弃了术语和计数？

1 个回答

撰写回答