从字符串中移除单词列表

51 投票

7 回答

149059 浏览

提问于 2025-04-18 17:33

我有一个停用词的列表，还有一个搜索字符串。我想从这个字符串中去掉这些停用词。

举个例子：

stopwords=['what','who','is','a','at','is','he']
query='What is hello'

现在代码应该去掉'What'和'is'这两个词。但是在我的情况下，它却去掉了'a'和'at'。我把我的代码放在下面。可能是哪里出了问题呢？

for word in stopwords:
    if word in query:
        print word
        query=query.replace(word,"")

如果输入的查询是"What is Hello"，我得到的输出是：
wht s llo

为什么会这样呢？

字符串处理代码调试搜索算法自然语言处理文本清理停用词

7 个回答

在编程中，有时候我们会遇到一些问题，特别是在使用某些工具或库的时候。这些问题可能会让我们感到困惑，尤其是当我们刚开始学习编程的时候。比如，有些错误信息可能会让人摸不着头脑，不知道该怎么解决。

当你在编写代码时，如果遇到错误，首先要冷静下来，仔细阅读错误信息。很多时候，错误信息会告诉你问题出在哪里，或者提示你需要做什么。其次，可以尝试在网上搜索这个错误，看看其他人是怎么解决的。社区里有很多经验丰富的程序员，他们可能遇到过类似的问题，并分享了解决方案。

另外，记得多做实验和练习，编程就是一个不断尝试和学习的过程。遇到问题时，不要害怕去问别人，或者在论坛上发帖求助。大家都是从新手过来的，互相帮助是很正常的。

总之，编程的路上难免会遇到各种各样的挑战，但只要保持耐心，积极寻找解决办法，就一定能克服这些困难，逐渐变得更加熟练。

" ".join([x for x in query.split() if x not in stopwords])

回答于 2025-04-18 由 Python大师

分享举报

看了其他人对你问题的回答，我发现他们告诉你怎么做你想做的事情，但没有回答你最后提出的问题。

如果输入的查询是“Hello是什么”，我得到的输出是：

wht s llo

为什么会这样呢？

这是因为 .replace() 方法是精确替换你给它的子字符串。

举个例子：

"My, my! Hello my friendly mystery".replace("my", "")

结果是：

>>> "My, ! Hello  friendly stery"

.replace() 本质上是把字符串按照你给的第一个参数（子字符串）切开，然后用第二个参数把它们重新拼接起来。

"hello".replace("he", "je")

这在逻辑上类似于：

"je".join("hello".split("he"))

如果你还想用 .replace() 来删除整个单词，你可能会觉得在前后加个空格就足够了，但这样会漏掉字符串开头和结尾的单词，以及带标点的子字符串。

"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"

"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"

"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"

而且，在前后加空格也无法处理重复的情况，因为它已经处理过第一个子字符串，会忽略它，继续处理后面的内容：

"hello my my friend".replace(" my ", " ")
>>> "hello my friend"

基于这些原因，你接受的答案来自 Robby Cornelissen 是推荐的做法，能更好地满足你的需求。

回答于 2025-04-18 由 Python大师

分享举报

根据karthikr说的内容，试试这个：

' '.join(filter(lambda x: x.lower() not in stopwords,  query.split()))

解释：

query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]

filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
                      # filters it based on the function which will take in one item at
                      # a time and return true.false

lambda x: x.lower() not in stopwords   # anonymous function that takes in variable,
                                       # converts it to lower case, and returns true if
                                       # the word is not in the iterable stopwords


' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
                   #using the string/char in front of the dot, i.e. ' ' as a joiner.
                   # i.e. ["What", "is","hello"] -> "What is hello"

回答于 2025-04-18 由 Python大师

分享举报

这个被接受的答案在处理用空格分开的单词列表时有效，但在现实生活中，单词之间可能会有标点符号分隔，这种情况下就需要用到 re.split。

另外，把 stopwords 作为一个 set 来测试，可以让查找速度更快（即使在单词数量很少的情况下，字符串哈希和查找之间会有一些权衡）。

我的建议是：

import re

query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}

resultwords  = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)

输出（作为单词列表）：

['hello','Says','']

最后会有一个空字符串，因为 re.split 会烦人地产生空字段，需要把它过滤掉。这里有两种解决方案：

resultwords  = [word for word in re.split("\W+",query) if word and word.lower() not in stopwords]  # filter out empty words

或者把空字符串添加到停用词列表中 :)

stopwords = {'what','who','is','a','at','is','he',''}

现在代码输出：

['hello','Says']

回答于 2025-04-18 由 Python大师

分享举报

这是其中一种做法：

query = 'What is hello'
stopwords = ['what', 'who', 'is', 'a', 'at', 'is', 'he']
querywords = query.split()

resultwords  = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)

print(result)

我注意到你还想在列表中如果有某个单词的小写形式，就把这个单词也去掉，所以我在条件检查中加上了一个调用 lower() 的步骤。

回答于 2025-04-18 由 Python大师

分享举报

从字符串中移除单词列表

7 个回答

撰写回答