去除停用词和标点符号

10 投票

3 回答

39640 浏览

提问于 2025-04-16 15:04

我在使用NLTK的停用词时遇到了一些困难。

这是我的一段代码……有人能告诉我哪里出问题了吗？

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''

文本处理 nltk 停用词

3 个回答

还有一种使用更现代模块的选择（2020年）

from nltk.corpus import stopwords
from textblob import TextBlob

def removeStopwords( texto):
    blob = TextBlob(texto).words
    outputlist = [word for word in blob if word not in stopwords.words('spanish')]
    return(' '.join(word for word in outputlist))

回答于 2025-04-16 由 Python大师

分享举报

首先，使用一个分词器，你可以把一堆符号（也就是标记）和一个停止词列表进行比较，这样就不需要用到 re 模块了。我还加了一个额外的参数，这样可以在不同语言之间切换。

def remove_stopwords(sentence, language):
    return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ]

告诉我这对你有没有帮助；)

回答于 2025-04-16 由 Python大师

分享举报

你的问题是，字符串的迭代器返回的是每个字符，而不是每个单词。

举个例子：

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

你需要逐个检查每个单词，幸运的是，Python的标准库里已经有一个叫做split的函数可以帮你。不过，因为你在处理自然语言，还包括标点符号，所以你可以看看这里，那里的答案更全面，使用了re模块。

一旦你得到了一个单词列表，记得在比较之前把它们都转换成小写，然后按照你之前展示的方式进行比较。

祝你好运。

编辑 1

好的，试试这个代码，它应该能帮到你。这里展示了两种方法，基本上是一样的，不过第一种方法更清晰，而第二种更符合Python的风格。

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words

希望这对你有帮助。

回答于 2025-04-16 由 Python大师

分享举报

去除停用词和标点符号

3 个回答

编辑 1

撰写回答