创建tf-idf值矩阵

1 投票

2 回答

6601 浏览

提问于 2025-04-18 08:18

我有一组 documents，内容像这样：

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

还有一组 words，内容像这样：

"sky","land","sea","water","sun","moon"

我想创建一个像这样的矩阵：

   x        D1           D2         D3
sky         tf-idf       0          tf-idf
land        0            0          0
sea         0            0          0
water       0            0          0
sun         0            tf-idf     tf-idf
moon        0            0          0

就像这里给出的示例表格：http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html。在这个链接中，使用的是文档中的相同单词，但我需要使用我提到的那组 words。

如果某个特定的单词在文档中出现，那么我就在矩阵中填入 tf-idf 值；如果没有出现，我就在矩阵中填入 0。

有没有什么办法可以帮我构建这样的矩阵？用 Python 最好，但 R 也可以。

我正在使用以下代码，但不确定我是否在做正确的事情。我的代码是：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords


train_set = "The sky is blue.", "The sun is bright.", "The sun in the sky is bright." #Documents
test_set = ["sky","land","sea","water","sun","moon"] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
#print 'Fit Vectorizer to train set', trainVectorizerArray
#print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
#print
#print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
#print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

我得到的结果非常奇怪，只有 0 和 1，而我期望的值应该在 0 和 1 之间。

[[ 0.  0.  1.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]]

我也愿意尝试其他库来计算 tf-idf。我只想要一个正确的矩阵，正如我之前提到的。

数据表示文本处理信息检索机器学习统计分析特征提取 tf-idf 矩阵构建

2 个回答

我觉得你想要的是

vectorizer = TfidfVectorizer(stop_words=stopWords, vocabulary=test_set)
matrix = vectorizer.fit_transform(train_set)

（就像我之前说的，这不是一个测试集，而是一个词汇表。）

回答于 2025-04-18 由 Python大师

分享举报

一个用R语言解决问题的方案可能是这样的：

library(tm)
docs <- c(D1 = "The sky is blue.",
          D2 = "The sun is bright.",
          D3 = "The sun in the sky is bright.")
dict <- c("sky","land","sea","water","sun","moon")
mat <- TermDocumentMatrix(Corpus(VectorSource(docs)), 
                          control=list(weighting =  weightTfIdf, 
                                       dictionary = dict))
as.matrix(mat)[dict, ]
#         Docs
# Terms          D1        D2        D3
#   sky   0.5849625 0.0000000 0.2924813
#   land  0.0000000 0.0000000 0.0000000
#   sea   0.0000000 0.0000000 0.0000000
#   water 0.0000000 0.0000000 0.0000000
#   sun   0.0000000 0.5849625 0.2924813
#   moon  0.0000000 0.0000000 0.0000000

回答于 2025-04-18 由 Python大师

分享举报

创建tf-idf值矩阵

2 个回答

撰写回答