在大列表中查找/搜索的最有效方法（python）

36 投票

3 回答

42141 浏览

提问于 2025-04-15 21:57

-- 我刚处理了一个大文件，创建了一个包含42,000个字符串或单词的列表。我想查询这个列表，检查一个给定的单词或字符串是否在里面。所以我的问题是：

有什么高效的方法可以进行这样的查找吗？

一种初步的方法是先对列表进行排序（list.sort()），然后直接使用

>> if word in list: print 'word'

这其实很简单，但我相信还有更好的方法。我的目标是找到一种快速查找的方法，能够判断一个给定的字符串是否在这个列表中。如果你有其他数据结构的想法，我都很欢迎。不过，我现在想避免使用像Trie这样的复杂数据结构。我想听听关于快速查找的想法（或者技巧），或者任何其他可能比简单的in更快的Python库方法。

另外，我还想知道搜索项的索引。

性能优化数据结构排序算法搜索算法索引查找字符串搜索高效查找列表查询

3 个回答

根据这个程序的运行结果，字典（dict）是最快的，其次是集合（set），然后是列表（list），而且列表在使用双重包含（bi_contains）时速度排在第三位：

from datetime import datetime

def ReadWordList():
    """ Loop through each line in english.txt and add it to the list in uppercase.

    Returns:
    Returns array with all the words in english.txt.

    """
    l_words = []
    with open(r'c:\english.txt', 'r') as f_in:
        for line in f_in:
            line = line.strip().upper()
            l_words.append(line)

    return l_words

# Loop through each line in english.txt and add it to the l_words list in uppercase.
l_words = ReadWordList()
l_words = {key: None for key in l_words}
#l_words = set(l_words)
#l_words = tuple(l_words)

t1 = datetime.now()

for i in range(10000):
    #w = 'ZEBRA' in l_words
    w = bi_contains(l_words, 'ZEBRA')

t2 = datetime.now()
print('After: ' + str(t2 - t1))

# list = 41.025293 seconds
# dict = 0.001488 seconds
# set = 0.001499 seconds
# tuple = 38.975805 seconds
# list with bi_contains = 0.014000 seconds

回答于 2025-04-15 由 Python大师

分享举报

关于集合和列表有一个点没有被提到：在“解析一个大文件”时，我们通常需要处理重复的单词或字符串。你没有提到这一点。

显然，把新单词添加到集合中时，会自动去掉重复的单词，这样做不会增加你的计算时间或思考时间。如果你用列表来做这个，效率就会变得很低，复杂度是O(N**2)。如果你先把所有东西都加到列表里，然后再去掉重复的，最聪明的方法就是…… drum roll ……使用集合。而且，列表在内存上稍微占优势，但重复的单词会让这个优势变得微不足道。

回答于 2025-04-15 由 Python大师

分享举报

不要创建一个 list（列表），而是创建一个 set（集合）。因为集合查找的速度非常快，几乎是瞬间完成。

如果你不想使用集合带来的额外内存开销，那就保持一个有序的列表，然后可以用 bisect 模块来搜索这个列表。

from bisect import bisect_left
def bi_contains(lst, item):
    """ efficient `item in lst` for sorted lists """
    # if item is larger than the last its not in the list, but the bisect would 
    # find `len(lst)` as the index to insert, so check that first. Else, if the 
    # item is in the list then it has to be at index bisect_left(lst, item)
    return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)

回答于 2025-04-15 由 Python大师

分享举报

在大列表中查找/搜索的最有效方法（python）

3 个回答

撰写回答