<p>在sklearn的实现过程中,有两件事情是你可能无法预料的:</p>
<ol>
<li><code>TfidfTransformer</code>将<code>smooth_idf=True</code>作为默认参数</li>
<li>它总是增加1的重量</li>
</ol>
<p>所以它使用:</p>
<pre><code>idf = log( 1 + samples/documents) + 1
</code></pre>
<p>这里是源代码:</p>
<p><a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992" rel="noreferrer">https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992</a></p>
<p>编辑:
您可以将标准<code>TfidfVectorizer</code>类划分为子类,如下所示:</p>
<pre><code>import scipy.sparse as sp
import numpy as np
from sklearn.feature_extraction.text import (TfidfVectorizer,
_document_frequency)
class PriscillasTfidfVectorizer(TfidfVectorizer):
def fit(self, X, y=None):
"""Learn the idf vector (global term weights)
Parameters
----------
X : sparse matrix, [n_samples, n_features]
a matrix of term/token counts
"""
if not sp.issparse(X):
X = sp.csc_matrix(X)
if self.use_idf:
n_samples, n_features = X.shape
df = _document_frequency(X)
# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
# log+1 instead of log makes sure terms with zero idf don't get
# suppressed entirely.
####### + 1 is commented out ##########################
idf = np.log(float(n_samples) / df) #+ 1.0
#######################################################
self._idf_diag = sp.spdiags(idf,
diags=0, m=n_features, n=n_features)
return self
</code></pre>