从文本文件输入中删除重复单词？

2 投票

4 回答

2984 浏览

提问于 2025-04-18 05:48

我在玩一个函数，这个函数需要三个参数：一个文本文件的名字、一个子字符串1和一个子字符串2。它会在这个文本文件里搜索，并返回包含这两个子字符串的单词。

def myfunction(filename, substring1, substring2)
    result = ""
    text=open(filename).read().split()
    for word in text:
        if substring1 in word and substring2 in word:
            result+=word+" "
    return result

这个函数是可以工作的，但我想去掉重复的结果。例如，在我的特定文本文件中，如果子字符串1是“at”，子字符串2是“wh”，它会返回“what”，但是因为我的文本文件里有三个“what”，所以它会返回所有的“what”。我想找到一种方法，只返回唯一的单词，而不是重复的，并且我还想保持结果的顺序，这样的话“集合”就不适用了。

我在想，也许可以对“文本”做点什么，在循环之前就去掉重复的部分。

文本处理唯一性文本文件函数参数字符串搜索数据去重重复单词结果排序

4 个回答

我觉得，如果你想保持顺序，最好的办法是把 results 设成一个列表，然后在添加每个 word 之前检查一下它是不是已经在列表里了。另外，你最好使用 with 这个上下文管理器来处理文件，这样可以确保文件在用完后能正确关闭：

def myfunction(filename, substring1, substring2)
    result = []
    with open(filename) as f:
        text = f.read().split()
    for word in text:
        if substring1 in word and substring2 in word and word not in result:
            result.append(word)
    return " ".join(result)

回答于 2025-04-18 由 Python大师

分享举报

请使用with语句来管理文件的使用，这样可以更好地处理文件的打开和关闭。用一个列表来检查某个字符串是否在其中，这样就能完成你的需求：

def myfunction(filename, substring1, substring2)
    result = []
    with open(filename) as f:
        for word in f.read().split():
            if substring1 in word and substring2 in word:
                 if not word in result:
                     result.append(word)
        return result

另外，考虑返回一个列表而不是字符串，因为你可以随时把列表转换成字符串，这样做很简单：

r = myfunction(arg1, arg2, arg3)
print(",".join(r))

编辑：

@EOL说得很对，所以我这里提供两种更高效的方法（但稍微占用更多内存）：

from collections import OrderedDict
def myfunction(filename, substring1, substring2)
    result = OrderedDict()
    with open(filename) as f:
        for word in f.read().split():
            if substring1 in word and substring2 in word:
                 result[word] = None # here we don't care about the stored value, only the key
        return result.values()

OrderedDict是一种字典，它能保持插入的顺序。而字典的键其实是一个特殊的set，它的特点是只有唯一的值。所以如果一个键已经在字典里，再次插入时会被默默忽略。这个操作比在列表中查找一个值要快得多。

回答于 2025-04-18 由 Python大师

分享举报

其实，你只需要把 result 改成一个列表，而不是字符串。然后，在添加每个单词之前，你可以先检查一下 if word not in result:。最后，你可以通过 ''.join(result) 把这个列表转换成一个用空格分开的字符串。

这样做可以保持单词出现的顺序，而用集合的话就不能保持顺序了。

回答于 2025-04-18 由 Python大师

分享举报

这里有一个解决方案，它使用了很少的内存（通过遍历文件中的每一行）并且有不错的时间效率（当返回的单词列表很大时，这一点很重要，比如当substring1是"a"，substring2是"e"时，适用于英语）：

import collections

def find_words(file_path, substring1, substring2)
    """Return a string with the words from the given file that contain both substrings."""
    matching_words = collections.OrderedDict()
    with open(file_path) as text_file:
        for line in text_file:
            for word in line.split():
                if substring1 in word and substring2 in word:
                    matching_words[word] = True
    return " ".join(matching_words)

OrderedDict可以保持键被首次使用的顺序，所以它能保持单词出现的顺序。因为它是一个映射，所以不会有重复的单词。良好的时间效率得益于在OrderedDict中插入一个键是常量时间（而许多其他解决方案中，像if word in result_list这样的操作是线性时间）。

回答于 2025-04-18 由 Python大师

分享举报

从文本文件输入中删除重复单词？

4 个回答

撰写回答