使用Python在大文本文件中快速搜索字符串的方法

1 投票

6 回答

7714 浏览

提问于 2025-04-17 16:49

这是我目前的情况：

我有一个2.5MB的文本文件，里面大约有25万个字符串，按字母顺序排列
每个字符串都是独一无二的
我不需要修改文本文件里的内容：一旦加载了文本文件，就不会再编辑了
文本文件在开始时加载，然后我只需要在里面搜索字符串

最后一点就是问题所在。实际上，我需要搜索字符串的完全匹配和部分匹配。我写的算法使用了正则表达式，并尝试让过程更快：比如，我在脚本中硬编码了字母表中每个字母的索引，然后把这个大文本文件分成了26个小字典。结果完全没用，脚本还是非常慢。在这里浏览了一些帖子后，我决定尝试mmap：但我觉得这对找到所有部分匹配没有什么帮助，尤其是用正则表达式的时候。最后我得出结论，可能使用trie（字典树）可以解决我的问题，尽管我对这个概念了解不多。我应该使用trie吗？如果是的话，我该如何在Python中创建一个trie？marisa-trie模块好吗？感谢大家

补充说明：我所说的“部分匹配”，是指我有一个字符串的前缀。我不需要在字符串的末尾或中间找到匹配，只需要在开头找到。

正则表达式文本处理算法优化部分匹配字符串搜索 mmap 字典树 trie

6 个回答

使用一种叫做字典树的数据结构。

#dictionary is a list of words
def parse_dictionary(dictionary):
    dictionary_trie = {}
    for word in dictionary:
        tmp_trie = dictionary_trie
        for letter in word:
            if letter not in tmp_trie:
                tmp_trie[letter] = {}
            if 'words' not in tmp_trie[letter]:
                tmp_trie[letter]['words'] = []

            tmp_trie[letter]['words'].append(word)
            tmp_trie = tmp_trie[letter]
    return dictionary_trie

def matches(substring, trie):
    d = trie
    for letter in substring:
        try:
            d = d[letter]
        except KeyError:
            return []
    return d['words']

使用示例：

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
       'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
                   'words': ['hello']},
             'words': ['hello']},
       'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
                                     'words': ['testing']},
                               'words': ['testing']},
                         'words': ['test', 'testing']},
                   'words': ['test', 'testing']},
             'words': ['test', 'testing']},
       'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
                   'words': ['world']},
             'words': ['world']},
       'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>>

回答于 2025-04-17 由 Python大师

分享举报

因为这个文件已经排好序并且被读取进来了，所以你可以直接在里面用二分查找，而不需要使用复杂的数据结构。Python里有一个内置的二分查找函数，叫做 bisect.bisect_left。

回答于 2025-04-17 由 Python大师

分享举报

最简单和最快的解决方案：

#!/usr/bin/env python

d = {}

# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
    line = line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        d[line[:i]] = True
f.close()


while True:
    w = raw_input('> ')
    if not w:
        break

    if w in d:
        print "match found", w

这里有一个稍微复杂一点，但更节省内存的方案：

#!/usr/bin/env python

d = []

def binary_search(a, x, lo=0, hi=None):
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        midval = a[mid]
        if midval < x:
            lo = mid+1
        elif midval > x:
            hi = mid
        else:
            return mid
    return -1


f = open("/etc/hosts","r")
for line in f:
    line=line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        x = hash(line[:i])
        d.append(x)
f.close()

d.sort()

while True:
    w = raw_input('> ')
    if not w:
        break

    if binary_search(d, hash(w)) != -1:
        print "match found", w

回答于 2025-04-17 由 Python大师

分享举报

使用Python在大文本文件中快速搜索字符串的方法

6 个回答

撰写回答