使用scikit-learn的TfidfVectorizer时，NLTK停用词出现Unicode警告

3 投票

1 回答

4873 浏览

提问于 2025-04-18 18:10

我正在尝试使用scikit-learn中的Tf-idf向量化工具，并且想用NLTK里的西班牙语停用词：

from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))

但是我遇到了以下警告：

/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]

有没有简单的方法来解决这个问题呢？

1 个回答

其实这个问题比我想的要简单得多。这里的问题是，NLTK返回的不是unicode对象，而是字符串对象。所以在使用它们之前，我需要先把它们从utf-8格式解码一下：

stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]

回答于 2025-04-18 由 Python大师

分享举报

使用scikit-learn的TfidfVectorizer时，NLTK停用词出现Unicode警告

1 个回答

撰写回答