在sklearn中理解countvector中的“ngram_range”参数 - 问答 - Python中文网

在sklearn中理解countvector中的“ngram_range”参数

2024-05-12 20:01:56 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

对于如何在Python的scikit学习库中使用ngram，特别是ngram_range参数如何在countvector中工作，我有点困惑。

运行此代码：

from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_

给我：

{'hi ': 0, 'bye': 1, 'run away': 2}

我有一个（显然是错误的）印象，我会得到一个统一的和大图，像这样：

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

我在这里处理文档：http://scikit-learn.org/stable/modules/feature_extraction.html

很明显，我对如何使用ngram的理解有严重的错误。也许这个论点没有效果，或者我对什么是真正的大人物有一些概念上的问题！我被难住了。如果有人能给我一个忠告，我会很感激的。

更新：
我已经意识到我的行为是愚蠢的。我的印象是ngram_range会影响词汇，而不是语料库。

Tags： run 错误 range scikit hi cv feature bye

1条回答

网友

1楼 · 发布于 2024-05-12 20:01:56

显式设置vocabulary意味着没有从数据中学习词汇。如果不设置，您将得到：

>>> v = CountVectorizer(ngram_range=(1, 2))
>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
{u'an': 0,
 u'an apple': 1,
 u'apple': 2,
 u'apple day': 3,
 u'away': 4,
 u'day': 5,
 u'day keeps': 6,
 u'doctor': 7,
 u'doctor away': 8,
 u'keeps': 9,
 u'keeps the': 10,
 u'the': 11,
 u'the doctor': 12}

显式词汇表限制将从文本中提取的术语；词汇表不会更改：

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

（注意，stopword过滤是在n-gram提取之前应用的，因此"apple day"。）

相关问题更多 >

编程相关推荐

热门问题

热门文章