使用gensim计算Tfidf

acceptance [ 0 0.4 0 0.3 0.7 0 information 0 0.7 0 0.5 0 0 media 0.5 0 0.4 0 0 1 model 0 0 0.6 0.5 0 0 selection 0.8 0 0.6 0 0 0 technology 0 0.4 0 0.3 0.7 0]

1条回答

网友

1楼 · 发布于 2024-05-26 11:54:12

正如你所提到的，计算结果之间存在这种差异的原因是文献中有许多计算TF-IDF的方法。如果你读到Wikipedia TF-IDF page，它提到TF-IDF的计算公式是

^{bq}$

而tf（t，d）和idf（t，d）都可以用不同的函数来计算，这些函数会改变tf_idf值的最终结果。实际上，函数在不同的应用程序中的用法是不同的。在

Gensim TF-IDF Model可以计算tf（t，d）和idf（t，d）的任何函数，正如它在文档中提到的那样。在

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:
weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})
or, more generally:
weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)
so you can plug in your own custom wlocal and wglobal functions.
Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.

现在，如果你想得到精确的纸上结果，你必须知道它用来计算TF-IDF矩阵的函数。在

在Gensim google group中还有一个很好的例子，它展示了如何使用自定义函数来计算TF-IDF。在

相关问题更多 >

编程相关推荐

热门问题

热门文章