矢量拟合变换在sklearn中是如何工作的?

2024-04-20 05:16:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着理解下面的代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我试图打印X以查看返回的内容时,我得到了以下结果:

^{pr2}$

但是,我不明白这个结果的意义?在


Tags: the代码textfromiscorpussklearnthis
3条回答

它将文本转换为数字。因此,使用其他函数,您可以计算每个单词在给定数据集中存在的次数。我不熟悉编程,所以可能还有其他领域可以使用。在

正如@Himanshu所写,这是一个“(句子索引,特征索引)计数”

这里,计数部分是“单词在文档中出现的次数”

例如

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

让我们更改代码中的语料库。基本上,我在语料库列表的第二句话中添加了两次“second”。在

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

您可以将其解释为“(句子索引,功能索引)计数”

因为有三个句子:从0开始到2结束

特征索引是可以从中获取的单词索引矢量器.词汇表在

->词汇词典{单词:特征索引,…}

所以对于示例(0,1)1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

如果使用tfidf向量器see here,而不是count向量器,它将给出u tfidf值。 我希望我说得很清楚

相关问题 更多 >