Python中正则表达式字符串损坏效率慢的解释？

content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in" newstir = "" flag = False for l in content.split('\n'): if re.search(">>>FOOBAR<<<", l): flag = True #End if we encountered our flag line if flag: newstir += l #End loop through content content = newstir

1条回答

网友

1楼 · 发布于 2024-05-29 06:05:30

对于以.*或.*?开始的模式，尤其是对于大数据，没有好的方法可以做到这一点，因为第一种模式将导致大量回溯，而第二种模式必须在以下子模式失败（直到成功）的情况下测试每个执行的字符。使用非贪婪量词并不比使用贪婪量词快。你知道吗

我怀疑您的~600k内容数据在一开始就在一个文件中。不是加载整个文件并将其内容存储到变量中，而是逐行工作。这样可以保留内存，避免拆分和创建行列表。第二件事，如果您要查找文本字符串，请不要使用regex方法，而是使用一个简单的字符串方法，如find，它更快：

with open('yourfile') as fh:
    for line in fh:
        result += line
        if line.find('>>>FOOBAR<<<') > -1:
            break

如果>>>FOOBAR<<<不是一个简单的文本字符串，而是一个regex模式，在本例中，请先编译该模式：

pat = re.compile(r'>>>[A-Z]+<<<')

with open('yourfile') as fh:
    for line in fh:
        result += line
        if pat.search(line):
            break

相关问题更多 >

编程相关推荐

热门问题

热门文章