Python - 移除包含列表中其他词语的所有单词

3 投票

7 回答

6399 浏览

提问于 2025-04-16 10:25

我有一个列表，里面装的是字典里的单词。我想找个方法，把所有的单词都去掉，只保留那些在目标单词开头的根词。

举个例子，单词“rodeo”会被去掉，因为它包含了一个有效的英文单词“rode”。单词“Typewriter”也会被去掉，因为它包含了有效的英文单词“type”。但是，单词“snicker”仍然有效，即使它里面有“nick”，因为“nick”是在单词的中间，而不是开头。

我在想可以这样做：

 for line in wordlist:
        if line.find(...) --

但是我希望这个“if”语句能检查列表中的每一个单词，看看它是否被找到，如果找到了，就把它从列表中去掉，这样只剩下根词。请问我需要创建一个单词列表的副本来遍历吗？

数据结构字符串处理条件判断列表遍历单词过滤根词提取英文单词词语匹配

7 个回答

我觉得jkerian的回答是最好的（假设只有一个列表），我想解释一下原因。

这是我写的代码版本（作为一个函数）：

wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];

def root_words(wordlist):
    result = []
    base = wordlist[0]
    for word in wordlist:
        if not word.startswith(base):
            result.append(base)
            base=word
    result.append(base)
    return result;

print root_words(wordlist);

只要这个单词列表是排好序的（如果你愿意，可以在函数里进行排序），那么就可以一次性得到结果。这是因为当你把列表排序后，所有由列表中的某个单词组成的其他单词，都会紧跟在那个根单词后面。比如说，在你的列表中，“arc”和“arcane”之间的任何东西，都会因为根单词“arc”而被直接排除掉。

回答于 2025-04-16 由 Python大师

分享举报

你有两个列表：一个是你想检查并可能删除的单词列表，另一个是有效单词的列表。如果你愿意，可以用同一个列表来做这两件事，但我假设你有两个不同的列表。

为了提高速度，你应该把有效单词的列表转换成一个集合。这样你就可以很快检查某个特定的单词是否在这个集合里。接着，逐个检查每个单词，看看它的所有前缀是否都在有效单词列表中。比如说，“a”和“I”在英语中都是有效单词，那么你会删除所有以'a'开头的有效单词，还是会设定一个前缀的最小长度规则呢？

我使用的是我Ubuntu系统中的文件/usr/share/dict/words。这个文件里有各种奇怪的东西；例如，它似乎把每个字母单独列成了一个单词。所以“k”、“q”、“z”等等都在里面。就我所知，这些都不是单词，但可能是出于某种技术原因而存在的。无论如何，我决定从我的有效单词列表中排除任何少于三个字母的单词。

这是我想到的：

# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)

lst = ["ark", "booze", "kite", "live", "rodeo"]

def subwords(word):
    for i in range(len(word) - 1, 0, -1):
        w = word[:i]
        yield w

newlst = []
for word in lst:
    # uncomment these for debugging to make sure it works
    # print "subwords", [w for w in subwords(word)]
    # print "valid subwords", [w for w in subwords(word) if w in valid]
    if not any(w in valid for w in subwords(word)):
        newlst.append(word)

print(newlst)

如果你喜欢一行代码的写法，可以省去for循环，使用列表推导式：

newlst = [word for word in lst if not any(w in valid for w in subwords(word))]

我觉得这样写有点过于简洁了，我更喜欢能加上打印语句来调试。

嗯，想想看，如果你再加一个函数，这样就不会太简洁了：

def keep(word):
    return not any(w in valid for w in subwords(word))

newlst = [word for word in lst if keep(word)]

如果你像这样创建函数，并给它们起个好名字，Python会变得容易阅读和理解。

回答于 2025-04-16 由 Python大师

分享举报

我假设你只有一个列表，你想从这个列表中删除那些以列表中其他元素为开头的项。

#Important assumption here... wordlist is sorted

base=wordlist[0]                      #consider the first word in the list
for word in wordlist:                 #loop through the entire list checking if
    if not word.startswith(base):     # the word we're considering starts with the base
        print base                    #If not... we have a new base, print the current
        base=word                     #  one and move to this new one
    #else word starts with base
        #don't output word, and go on to the next item in the list
print base                            #finish by printing the last base

编辑：我添加了一些注释，让逻辑更清晰。

回答于 2025-04-16 由 Python大师

分享举报

Python - 移除包含列表中其他词语的所有单词

7 个回答

撰写回答