在Python中高效查找短字符串是否在长字符串中

6 投票

3 回答

3367 浏览

提问于 2025-04-18 09:18

我有一个包含短字符串的文件，这些短字符串我已经加载到一个列表 short 中（总共有150万个短字符串，每个长度为150）。我想找出这些短字符串中有多少个出现在一个更长的字符串中，这个长字符串的长度大约是500万，代码中称它为 seq。我用了一种很简单的方法来实现这个功能。但是，这个方法似乎运行得很慢（大约需要一天的时间）。

count1=count2=0
for line in short:
    count1+=1
    if line in seq:
            count2+=1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'

有没有什么更有效的方法可以做到这一点呢？

性能优化数据结构文本处理高效算法大数据分析字符串查找

3 个回答

好的，我知道你已经接受了另一个效果不错的答案，但为了完整性，这里有一个根据RedX在评论中建议的详细版本（我想是这样的）。

import itertools
PREFIXLEN = 50  #This will need to be adjusted for efficiency, consider doing a sensitivity study

commonpres = itertools.groupby(sorted(short), lambda x: x[0:PREFIXLEN])
survivors = []
precount = 0

for pres in commonpres:
     precount += 1
     if pres[0] in seq:
         survivors.extend(pres[1])

postcount = len(survivors)

actcount = 0
for survivor in survivors:
    if survivor in seq:
        actcount += 1

print "{} of {} strings are in long string.".format(actcount, len(short))
print "{} short strings ruled out by groups".format(len(short) - len(survivors))
print "{} total comparisons done".format(len(survivors) + precount)

这里的想法是，在检查所有符合条件的survivors之前，尽量排除掉尽可能多的常见前缀。举个极端的例子，假设你的150万个短字符串可以分成10个常见的前缀。为了简单起见，我们假设每个前缀都有15万个字符串。如果我们能通过10次检查排除掉其中两个前缀，那么后面就能节省30万次检查。这就是为什么PREFIXLEN需要“调优”的原因。如果设置得太低，你会有太多的常见前缀，结果是你根本无法节省检查次数（前缀长度为1时 = 150万次检查）。而如果PREFIXLEN设置得太高，排除前缀的效果就会很小，因为能排除的数量不多。我随便选了50这个数字，可能对你有帮助，也可能没有。

正如我之前所说，这个答案基本上是学术性的，所以如果有人看到有什么可以改进的地方，请评论或者直接编辑。

回答于 2025-04-18 由 Python大师

分享举报

进行性能分析，并尝试不同的选项。你必须遍历你的“测试”字符串序列，所以 for line in short 这段代码你很可能会一直保留。测试 if line in seq 在 CPython 中实现得相当高效，但我觉得这并不适合在一个非常大的数据集中寻找一个小的目标。你的需求有点极端，我猜正是这个测试花费了相当长的时间，成为了你代码的瓶颈。你可以尝试使用 regex 模块来比较一下在大数据集中寻找小目标的效果。

编辑：

这是一个简单的基准测试（没有重复测试，没有研究扩展行为，也没有使用性能分析模块），用来比较这个讨论中提到的方法：

import string
import random
import time


def genstring(N):
    return ''.join(random.choice(string.ascii_uppercase) for _ in xrange(N))


t0 = time.time()
length_longstring = 10**6
length_shortstring = 7
nbr_shortstrings = 3*10**6
shortstrings = [genstring(length_shortstring) for _ in xrange(nbr_shortstrings)]
longstring = genstring(length_longstring)
duration = time.time() - t0
print "Setup duration: %.1f s" % duration


def method_1():
    count1 = 0
    count2 = 0
    for ss in shortstrings:
        count1 += 1
        if ss in longstring:
            count2 += 1
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


#t0 = time.time()
#method_1()
#duration = time.time() - t0
#print "M1 duration: %.1f s" % duration


def method_2():
    shortset = set()
    for i in xrange(len(longstring)-length_shortstring+1):
        shortset.add(longstring[i:i+length_shortstring])
    count1 = 0
    count2 = 0
    for ss in shortstrings:
        count1 += 1
        if ss in shortset:
            count2 += 1
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


t0 = time.time()
method_2()
duration = time.time() - t0
print "M2 duration: %.1f s" % duration


def method_3():
    shortset = set(
        longstring[i:i+length_shortstring] for i in xrange(
            len(longstring)-length_shortstring+1))
    count1 = len(shortstrings)
    count2 = sum(1 for ss in shortstrings if ss in shortset)
    print str(count2) + ' of ' + str(count1) + ' strings are in long string.'


t0 = time.time()
method_3()
duration = time.time() - t0
print "M3 duration: %.1f s" % duration

输出结果：

$ python test.py 
Setup duration: 23.3 s
364 of 3000000 strings are in long string.
M2 duration: 1.4 s
364 of 3000000 strings are in long string.
M3 duration: 1.2 s

（这是在 Linux 上运行的 Python 2.7.3，使用的是 E5-2650 0 @ 2.00GHz 的处理器）

nneonneo 提出的方案和 chepner 建议的改进之间有些许差别。在这种情况下，运行原始代码已经不太有趣。在稍微不那么极端的条件下，我们可以对这三种方法进行比较：

length_longstring = 10**6
length_shortstring = 5
nbr_shortstrings = 10**5

$ python test1.py 
Setup duration: 1.4 s
8121 of 100000 strings are in long string.
M1 duration: 95.0 s
8121 of 100000 strings are in long string.
M2 duration: 0.4 s
8121 of 100000 strings are in long string.
M3 duration: 0.4 s

回答于 2025-04-18 由 Python大师

分享举报

如果这些short字符串的长度是固定的（你提到它们的长度是150），你可以先处理一下长字符串，把所有的短字符串提取出来，然后只需要进行集合查找（这种查找在平均情况下是很快的）：

shortlen = 150
shortset = set()
for i in xrange(len(seq)-shortlen+1):
    shortset.add(seq[i:i+shortlen])

for line in short:
    count1 += 1
    if line in shortset:
        count2 += 1

这个过程的运行时间主要会被预处理的步骤占据（因为它要插入将近500万个长度为150的字符串），但这仍然会比在一个500万字符的字符串中进行150万次搜索要快。

回答于 2025-04-18 由 Python大师

分享举报

在Python中高效查找短字符串是否在长字符串中

3 个回答

撰写回答