Python 正则表达式解析流

36 投票

5 回答

11471 浏览

提问于 2025-04-16 09:37

有没有办法在Python中对一个流进行正则匹配呢？就像下面这样：

reg = re.compile(r'\w+')
reg.match(StringIO.StringIO('aa aaa aa'))

我不想通过获取整个字符串的值来实现这个。我想知道有没有办法在流上实时进行正则匹配。

正则表达式数据流流处理实时匹配

5 个回答

这似乎是个老问题。正如我在一个类似问题中提到的，你可能想要从我的解决方案中，子类化Matcher类，使用streamsearch-py进行缓冲区的正则匹配。可以看看kmp_example.py，里面有个模板。如果你发现经典的Knuth-Morris-Pratt匹配就能解决你的问题，那你现在就可以用这个小开源库来解决了:-)

回答于 2025-04-16 由 Python大师

分享举报

在处理文件时，如果你能使用 mmap 来映射文件，并且你处理的是字节串而不是Unicode字符串，那么你可以把这个内存映射的文件当作字节串传给 re 模块，它就能正常工作。这种方式的限制在于你的地址空间，而不是你的内存，所以一台有8GB内存的64位机器可以很顺利地映射一个32GB的文件。

如果你能这样做，那真是个不错的选择。如果做不到，就得考虑一些更麻烦的方案了。

第三方的 regex 模块（不是 re）提供了部分匹配的支持，这可以用来构建流式支持……不过这比较复杂，还有很多注意事项。比如说，像是回顾匹配和 ^ 这些功能就不能用了，零宽匹配也很难处理，我不确定它是否能和 regex 提供的其他高级功能正常配合，而这些在 re 中是没有的。不过，它似乎是目前最接近完整解决方案的选项。

如果你在调用 regex.match、regex.fullmatch、regex.search 或 regex.finditer 时传入 partial=True，那么除了报告完整匹配外，regex 还会报告那些如果数据扩展的话可能会匹配的内容：

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

如果更多的数据可能会改变匹配结果，它会报告部分匹配，而不是完整匹配。例如，regex.search(r'[\s\S]*', anything, partial=True) 将始终返回部分匹配。

通过这种方式，你可以保持一个滑动窗口来匹配数据，当窗口到达末尾时扩展它，并丢弃从开始部分消耗掉的数据。不幸的是，任何会因为字符串开头的数据消失而感到困惑的东西都无法使用，所以像回顾匹配、^、\b 和 \B 都不适用。零宽匹配也需要小心处理。下面是一个概念验证的例子，它使用滑动窗口来处理文件或类似文件的对象：

import regex

def findall_over_file_with_caveats(pattern, file):
    # Caveats:
    # - doesn't support ^ or backreferences, and might not play well with
    #   advanced features I'm not aware of that regex provides and re doesn't.
    # - Doesn't do the careful handling that zero-width matches would need,
    #   so consider behavior undefined in case of zero-width matches.
    # - I have not bothered to implement findall's behavior of returning groups
    #   when the pattern has groups.
    # Unlike findall, produces an iterator instead of a list.

    # bytes window for bytes pattern, unicode window for unicode pattern
    # We assume the file provides data of the same type.
    window = pattern[:0]
    chunksize = 8192
    sentinel = object()

    last_chunk = False

    while not last_chunk:
        chunk = file.read(chunksize)
        if not chunk:
            last_chunk = True
        window += chunk

        match = sentinel
        for match in regex.finditer(pattern, window, partial=not last_chunk):
            if not match.partial:
                yield match.group()

        if match is sentinel or not match.partial:
            # No partial match at the end (maybe even no matches at all).
            # Discard the window. We don't need that data.
            # The only cases I can find where we do this are if the pattern
            # uses unsupported features or if we're on the last chunk, but
            # there might be some important case I haven't thought of.
            window = window[:0]
        else:
            # Partial match at the end.
            # Discard all data not involved in the match.
            window = window[match.start():]
            if match.start() == 0:
                # Our chunks are too small. Make them bigger.
                chunksize *= 2

回答于 2025-04-16 由 Python大师

分享举报

我也遇到过同样的问题。最开始我想到了实现一个叫做 LazyString 的类，它的作用就像字符串一样，但只在需要的时候从数据流中读取数据（我通过重新实现 __getitem__ 和 __iter__ 来获取和缓存字符，直到访问到的最高位置为止……）。

不过这个方法没有成功（我在使用 re.match 时遇到了“TypeError: expected string or buffer”的错误），于是我开始查看标准库中 re 模块的实现。

不幸的是，在数据流上使用正则表达式似乎是不可能的。这个模块的核心部分是用C语言实现的，而这个实现要求输入的所有数据必须一次性全部在内存中（我猜主要是出于性能考虑）。看起来没有简单的方法可以解决这个问题。

我还查看了 PYL（Python LEX/YACC），但是他们的词法分析器内部使用了 re，所以这也无法解决问题。

一个可能的解决方案是使用 ANTLR，它支持Python后端。ANTLR使用纯Python代码构建词法分析器，并且似乎能够处理输入流。对我来说，这个问题并不是特别重要（我并不期待我的输入会非常大……），所以我可能不会进一步研究这个，但也许值得一试。

回答于 2025-04-16 由 Python大师

分享举报

Python 正则表达式解析流

5 个回答

撰写回答