矢量拟合变换在sklearn中是如何工作的？

from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] X = vectorizer.fit_transform(corpus)

3条回答

网友

1楼 · 编辑于 2024-04-20 05:16:12

它将文本转换为数字。因此，使用其他函数，您可以计算每个单词在给定数据集中存在的次数。我不熟悉编程，所以可能还有其他领域可以使用。在

网友

2楼 · 编辑于 2024-04-20 05:16:12

正如@Himanshu所写，这是一个“（句子索引，特征索引）计数”

这里，计数部分是“单词在文档中出现的次数”

例如

(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1

让我们更改代码中的语料库。基本上，我在语料库列表的第二句话中添加了两次“second”。在

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1
(0, 2) 1
(0, 6) 1
(0, 3) 1
(0, 8) 1
(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence
(1, 1) 1
(1, 6) 1
(1, 3) 1
(1, 8) 1
(2, 4) 1
(2, 7) 1
(2, 0) 1
(2, 6) 1
(3, 1) 1
(3, 2) 1
(3, 6) 1
(3, 3) 1
(3, 8) 1

网友

3楼 · 编辑于 2024-04-20 05:16:12

您可以将其解释为“（句子索引，功能索引）计数”

因为有三个句子：从0开始到2结束

特征索引是可以从中获取的单词索引矢量器.词汇表在

->词汇词典{单词：特征索引，…}

所以对于示例（0，1）1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

如果使用tfidf向量器see here，而不是count向量器，它将给出u tfidf值。我希望我说得很清楚

相关问题更多 >

编程相关推荐

热门问题

热门文章