如何使用nltk或python去除停用词

145 投票

14 回答

267366 浏览

提问于 2025-04-16 14:43

我有一个数据集，我想从中去掉一些常见的无意义词汇，这些词叫做停用词。

我使用了NLTK这个工具来获取停用词的列表：

from nltk.corpus import stopwords

stopwords.words('english')

那么，我到底该怎么把我的数据和这个停用词列表进行比较，从而把停用词从数据中去掉呢？

文本处理自然语言处理 nltk 停用词

14 个回答

如果你想排除所有类型的停用词，包括nltk库中的停用词，可以这样做：

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

回答于 2025-04-16 由 Python大师

分享举报

你也可以做一个集合差集，比如：

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

回答于 2025-04-16 由 Python大师

分享举报

248

在编程中，有时候我们会遇到一些问题，特别是在使用某些工具或库的时候。这些问题可能会让我们感到困惑，尤其是当我们刚开始学习编程的时候。比如，有些错误信息可能看起来很复杂，但其实它们是告诉我们哪里出了问题。

当你在写代码时，可能会发现某些功能没有按预期工作。这时候，查看错误信息是一个很好的开始。错误信息通常会指向代码中的某一行，告诉你发生了什么问题。理解这些信息是解决问题的关键。

另外，很多时候我们可以在网上找到解决方案，比如在StackOverflow这样的论坛上。这里有很多经验丰富的程序员分享他们的经验和解决方案，帮助像我们这样的初学者。

总之，遇到问题时不要灰心，仔细阅读错误信息，寻找解决方案，慢慢你就会变得更加熟练。

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

回答于 2025-04-16 由 Python大师

分享举报

如何使用nltk或python去除停用词

14 个回答

撰写回答