如何查看scikit-learn中tfidf后的词项-文档矩阵的前n个条目
我刚接触scikit-learn这个库,正在用TfidfVectorizer
来计算一组文档中词语的tfidf值。我用下面的代码得到了这些值。
vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True)
X = vectorizer.fit_transform(lectures)
现在如果我打印出X,我能看到矩阵里的所有条目,但我该怎么找到tfidf分数最高的前n个条目呢?另外,有没有什么方法可以帮助我找到每种n-gram(比如单词、双词、三词等)中tfidf分数最高的前n个条目呢?
1 个回答
65
从0.15版本开始,TfidfVectorizer
这个工具学习到的特征的全局权重可以通过属性idf_
来访问。这个属性会返回一个数组,数组的长度和特征的维度是一样的。你可以根据这个权重对特征进行排序,从而找出权重最高的特征:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lectures)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 2
top_features = [features[i] for i in indices[:top_n]]
print top_features
输出结果:
[u'food', u'drink']
想要通过ngram获取最重要的特征,可以用类似的方法,不过需要多一些步骤,把特征分成不同的组:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(lectures)
features_by_gram = defaultdict(list)
for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
features_by_gram[len(f.split(' '))].append((f, w))
top_n = 2
for gram, features in features_by_gram.iteritems():
top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
top_features = [f[0] for f in top_features]
print '{}-gram top:'.format(gram), top_features
输出结果:
1-gram top: [u'drink', u'food']
2-gram top: [u'some drink', u'some food']