正则表达式“|”运算符与每个子表达式的单独运行

def CountMatchesInBigstring(bigstring, my_regexes): """Counts how many of the expressions in my_regexes match bigstring.""" count = 0 combined_expr = '|'.join(['(%s)' % r for r in my_regexes]) matches = re.search(combined_expr, bigstring) if matches: count += NumMatches(matches) return count

3条回答

网友

1楼 · 编辑于 2024-05-14 15:22:37

我相信您的第一次实施会更快：

Python性能的一个关键原则是“将逻辑移到C级别”——这意味着内置函数（用C编写）比纯Python实现更快。因此，当循环由内置的Regex模块执行时，它应该更快
一个正则表达式可以在一个过程中搜索多个模式，这意味着它只需运行一次文件内容，而多个正则表达式必须多次读取整个文件。在

网友

2楼 · 编辑于 2024-05-14 15:22:37

这两种方法得到的结果会略有不同，除非保证一个匹配项只匹配一个正则表达式。否则，如果某个值与2匹配，则将计数两次。在

理论上，您的解决方案应该更快（如果表达式是互斥的），因为regex编译器应该能够创建一个更有效的搜索状态机，因此只需要一个过程。不过，我希望差别很小，除非表达式非常相似。在

另外，如果它是一个巨大的字符串（大于700k），那么只需一次就可以获得收益，因此所需的内存交换（到磁盘或cpu缓存）要少n倍。在

我打赌在你的测试中，这并不是很明显。我对实际结果感兴趣-请务必张贴结果。在

网友

3楼 · 编辑于 2024-05-14 15:22:37

要理解re模块的工作原理，请在调试模式下编译Šsre.c（将#define VERBOSE放到103行，然后重新编译python）。在这之后，你会看到这样的情况：


>>> import re
>>> p = re.compile('(a)|(b)|(c)')
>>> p.search('a'); print '\n\n'; p.search('b')
|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb75f4|SEARCH
|0xb7f9ab1a|0xb7fb75f4|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb75f4|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb75f4|MARK 0
|0xb7f9ab24|0xb7fb75f4|LITERAL 97
|0xb7f9ab28|0xb7fb75f5|MARK 1
|0xb7f9ab2c|0xb7fb75f5|JUMP 20
|0xb7f9ab56|0xb7fb75f5|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb75f4|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb75f5|END




|0xb7f9ab10|(nil)|SEARCH
prefix = (nil) 0 0
charset = (nil)
|0xb7f9ab1a|0xb7fb7614|SEARCH
|0xb7f9ab1a|0xb7fb7614|ENTER
allocating sre_match_context in 0 (32)
allocate/grow stack 1064
|0xb7f9ab1c|0xb7fb7614|BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab20|0xb7fb7614|MARK 0
|0xb7f9ab24|0xb7fb7614|LITERAL 97
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab1c|0xb7fb7614|JUMP_BRANCH
allocating sre_match_context in 32 (32)
|0xb7f9ab32|0xb7fb7614|MARK 2
|0xb7f9ab36|0xb7fb7614|LITERAL 98
|0xb7f9ab3a|0xb7fb7615|MARK 3
|0xb7f9ab3e|0xb7fb7615|JUMP 11
|0xb7f9ab56|0xb7fb7615|SUCCESS
discard data from 32 (32)
looking up sre_match_context at 0
|0xb7f9ab2e|0xb7fb7614|JUMP_BRANCH
discard data from 0 (32)
|0xb7f9ab10|0xb7fb7615|END

>>>

相关问题更多 >

编程相关推荐

热门问题

热门文章