创建tf-idf值矩阵
我有一组 documents
,内容像这样:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
还有一组 words
,内容像这样:
"sky","land","sea","water","sun","moon"
我想创建一个像这样的矩阵:
x D1 D2 D3
sky tf-idf 0 tf-idf
land 0 0 0
sea 0 0 0
water 0 0 0
sun 0 tf-idf tf-idf
moon 0 0 0
就像这里给出的示例表格:http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html。在这个链接中,使用的是文档中的相同单词,但我需要使用我提到的那组 words
。
如果某个特定的单词在文档中出现,那么我就在矩阵中填入 tf-idf
值;如果没有出现,我就在矩阵中填入 0
。
有没有什么办法可以帮我构建这样的矩阵?用 Python 最好,但 R 也可以。
我正在使用以下代码,但不确定我是否在做正确的事情。我的代码是:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray
transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
#print
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
我得到的结果非常奇怪,只有 0
和 1
,而我期望的值应该在 0 和 1 之间。
[[ 0. 0. 1. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 0. 0.]
[ 1. 0. 0. 0.]]
我也愿意尝试其他库来计算 tf-idf
。我只想要一个正确的矩阵,正如我之前提到的。
2 个回答
1
我觉得你想要的是
vectorizer = TfidfVectorizer(stop_words=stopWords, vocabulary=test_set)
matrix = vectorizer.fit_transform(train_set)
(就像我之前说的,这不是一个测试集,而是一个词汇表。)
2
一个用R语言解决问题的方案可能是这样的:
library(tm)
docs <- c(D1 = "The sky is blue.",
D2 = "The sun is bright.",
D3 = "The sun in the sky is bright.")
dict <- c("sky","land","sea","water","sun","moon")
mat <- TermDocumentMatrix(Corpus(VectorSource(docs)),
control=list(weighting = weightTfIdf,
dictionary = dict))
as.matrix(mat)[dict, ]
# Docs
# Terms D1 D2 D3
# sky 0.5849625 0.0000000 0.2924813
# land 0.0000000 0.0000000 0.0000000
# sea 0.0000000 0.0000000 0.0000000
# water 0.0000000 0.0000000 0.0000000
# sun 0.0000000 0.5849625 0.2924813
# moon 0.0000000 0.0000000 0.0000000