我可以在scikit-learn中使用CountVectorizer计算未用来提取令牌的文档频率吗？

43 投票

3 回答

86163 浏览

提问于 2025-04-18 01:45

我一直在使用scikit-learn中的CountVectorizer类。

我明白，如果按照下面的方式使用，最终的输出会是一个数组，这个数组里包含了特征或标记的计数。

这些标记是从一组关键词中提取出来的，也就是：

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

接下来的步骤是：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

这样我们就得到了：

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

这没问题，但我的情况稍微有点不同。

我想以同样的方式提取特征，但我不想让data中的行是提取特征的那些文档。

换句话说，我想知道如何从另一组文档中获取计数，比如：

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

然后得到：

[[0 0 0 1 0 0]
 [0 1 0 0 0 1]
 [0 0 0 0 0 0]]

我看过CountVectorizer类的文档，发现了vocabulary这个参数，它是一个术语到特征索引的映射。不过，我似乎无法让这个参数帮到我。

任何建议都很感谢。
PS：所有的功劳都归于Matthias Friedrich的博客，我用的例子来自那里。

机器学习文档处理 scikit-learn 关键词提取特征提取 countvectorizer 文档频率术语索引

3 个回答

>>> tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

>>> list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]

]

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vect = CountVectorizer()
>>> tags = vect.fit_transform(tags)

# vocabulary learned by CountVectorizer (vect)
>>> print(vect.vocabulary_)
{'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2}

# counts for tags
>>> tags.toarray()
array([[0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 1, 0]], dtype=int64)

# to use `transform`, `list_of_new_documents` should be a list of strings 
# `itertools.chain` flattens shallow lists more efficiently than list comprehensions

>>> from itertools import chain
>>> new_docs = list(chain.from_iterable(list_of_new_documents)
>>> new_docs = vect.transform(new_docs)

# finally, counts for new_docs!
>>> new_docs.toarray()
array([[0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]])

为了确认 CountVectorizer 是否在 new_docs 上使用了从 tags 学到的词汇，你可以再次打印 vect.vocabulary_，或者将 new_docs.toarray() 的输出和 tags.toarray() 的输出进行比较。

回答于 2025-04-18 由 Python大师

分享举报

你应该在原始的词汇来源上调用 fit_transform 或者直接调用 fit，这样向量化工具才能学习到词汇。

之后，你可以通过 transform() 方法在任何新的数据源上使用这个已经学习过的向量化工具。

你可以通过 vectorizer.vocabulary_ 来获取通过 fit 生成的词汇（也就是单词和它们对应的编号之间的映射），前提是你把 CountVectorizer 命名为 vectorizer。

回答于 2025-04-18 由 Python大师

分享举报

你说得对，vocabulary就是你需要的东西。它的工作原理是这样的：

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

你需要传入一个字典，这个字典的键就是你想要的特征。

如果你在一组文档上使用了CountVectorizer，然后想要在新的一组文档上使用之前的特征，可以使用你最初的CountVectorizer的vocabulary_属性，把它传给新的CountVectorizer。举个例子，你可以这样做：

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

这样就可以用你第一个CountVectorizer的词汇来创建一个新的分词器。

回答于 2025-04-18 由 Python大师

分享举报

我可以在scikit-learn中使用CountVectorizer计算未用来提取令牌的文档频率吗？

3 个回答

撰写回答