为什么python regex这么慢？

3条回答

网友

1楼 · 编辑于 2024-04-23 08:24:02

我认为这与灾难性的回溯（或者至少我自己的理解）无关。

这个问题是由(.*)sol(.*)中的第一个(.*)引起的，并且regex没有锚定在任何地方。

re.findall()在索引0失败后，将在索引1、2等处重试，直到字符串结束。

badbadbadbad...bad
^                   Attempt to match (.*)sol(.*) from index 0. Fail
 ^                  Attempt to match (.*)sol(.*) from index 1. Fail
  ^                 Attempt to match (.*)sol(.*) from index 2. Fail (and so on)

它有效地具有二次复杂度O（n²），其中n是字符串的长度。

这个问题可以通过锚定你的模式来解决，因此它会在你的模式没有机会匹配的位置立即失败。(.*)sol(.*)将在一行文本（由行结束符分隔）中搜索sol，因此，如果在行的开头找不到匹配项，则在该行的其余部分找不到匹配项。

因此，您可以使用：

^(.*)sol(.*)

使用^{}选项。

运行此测试代码（从您的代码中修改）：

import datetime
import re

pattern = re.compile('^(.*)sol(.*)', re.MULTILINE)

lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000,
       "ciao mandi "*10000 + "sal " + "ciao mandi "*10000]
for s in lst:
    print "string len", len(s)
    start = datetime.datetime.now()
    re.findall(pattern,s)
    print "time spent", datetime.datetime.now() - start
    print

（注意，传递和失败都是220004个字符）

给出以下结果：

string len 220004
time spent 0:00:00.002000

string len 220004
time spent 0:00:00.005000

这清楚地表明，这两个案件现在有相同的数量级。

网友

2楼 · 编辑于 2024-04-23 08:24:02

汤普森NFA方法将正则表达式从默认贪婪更改为默认非贪婪。普通正则表达式引擎也可以这样做；只需将.*更改为.*?。当非贪婪的时候，你不应该使用贪婪的表达式。

有人已经为Python构建了NFA正则表达式解析器：https://github.com/xysun/regex

对于病理情况，它确实优于默认的Python正则表达式解析器。但是，它在下执行其他所有操作：

This regex engine underperforms Python's re module on normal inputs (using Glenn Fowler's test suite -- see below)

以典型病例为代价修复病理病例可能是一个不使用NFA方法作为默认引擎的好理由，而不是在病理病例可以简单避免的情况下。

另一个原因是某些特性（比如反向引用）要么非常难实现，要么不可能使用NFA方法实现。另请参见response on the Python Ideas mailing list。

因此，如果您至少创建了一个非贪婪的模式以避免灾难性的回溯，那么您的测试可以执行得更好：

pattern = re.compile('(.*?)sol(.*)')

或者根本不使用regex；您可以使用str.partition()来获取前缀和后缀：

before, sol, after = s.partition('sol')

不是所有的文本问题都是正则表达式形状的，所以放下锤子，看看工具箱的其他部分！

此外，您还可以查看替代的re模块^{}。该模块实现了一些病理病例的基本检查，并巧妙地避免了这些检查，而无需借助汤普森NFA实现。引用an entry to a Python bug report tracking ^{}：

The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes.

这台引擎可以使你的病理病例运行速度快10万倍以上：

>>> import re, regex
>>> import timeit
>>> p_re = re.compile('(.*)sol(.*)')
>>> p_regex = regex.compile('(.*)sol(.*)')
>>> s = "ciao mandi "*1000 + "sal " + "ciao mandi "*1000
>>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_re as p', number=1)
2.4578459990007104
>>> timeit.timeit("p.findall(s)", 'from __main__ import s, p_regex as p', number=100000)
1.955532732012216

注意数字；我将re测试限制为1次运行，耗时2.46秒，而regex测试在不到2秒的时间内运行100k次。

网友
3楼 · 编辑于 2024-04-23 08:24:02

^(?=(.*?sol))\1(.*)$

你可以试试这个。这样可以减少回溯并更快地失败。在这里试试你的字符串。

http://regex101.com/r/hQ1rP0/22

相关问题更多 >

编程相关推荐

热门问题

热门文章