Python: 在文件中查找正则表达式

13 投票

3 回答

14074 浏览

提问于 2025-04-16 11:47

有：

f = open(...)  
r = re.compile(...)

需要：
在一个大文件中找到第一个匹配的正则表达式的位置（开始和结束位置）？
（从 current_pos=... 开始）

我该怎么做？

我想要这个功能：

def find_first_regex_in_file(f, regexp, start_pos=0):  
   f.seek(start_pos)  

   .... (searching f for regexp starting from start_pos) HOW?  

   return [match_start, match_end]

文件 'f' 预计会很大。

正则表达式数据处理文件处理文本搜索匹配算法

3 个回答

注意：这个内容是在python2.7上测试过的。如果你用的是python 3，可能需要调整一些地方来处理字符串和字节，但希望不会太麻烦。

内存映射文件可能不太适合你的情况（32位模式下，可能没有足够的连续虚拟内存，不能从管道或其他非文件读取等等）。

这里有一个解决方案，它一次读取128k的数据块，只要你的正则表达式匹配的字符串小于这个大小，就可以正常工作。还要注意，你不必只使用单行的正则表达式。这个方法运行得很快，尽管我怀疑它可能比使用内存映射稍慢一点。具体速度还得看你对匹配结果的处理，以及你要查找的正则表达式的大小和复杂度。

这个方法会确保内存中最多只保留2个数据块。你可能想在某些情况下确保每个数据块至少有1个匹配结果作为检查，但这个方法会截断数据，以保持最多2个数据块在内存中。它还确保任何匹配到当前数据块末尾的正则表达式不会被返回，而是会保存最后的位置，以便在输入数据用完或者我们有另一个数据块匹配时使用，这样可以更好地匹配像"[^\n]+"或"xxx$"这样的模式。如果你的正则表达式末尾有前瞻，比如xx(?!xyz)，而yz在下一个数据块中，可能还是会出问题，但在大多数情况下，你可以通过其他方式来解决这类问题。

import re

def regex_stream(regex,stream,block_size=128*1024):
    stream_read=stream.read
    finditer=regex.finditer
    block=stream_read(block_size)
    if not block:
        return
    lastpos = 0
    for mo in finditer(block):
        if mo.end()!=len(block):
            yield mo
            lastpos = mo.end()
        else:
            break
    while True:
        new_buffer = stream_read(block_size)
        if not new_buffer:
            break
        if lastpos:
            size_to_append=len(block)-lastpos
            if size_to_append > block_size:
                block='%s%s'%(block[-block_size:],new_buffer)
            else:
                block='%s%s'%(block[lastpos:],new_buffer)
        else:
            size_to_append=len(block)
            if size_to_append > block_size:
                block='%s%s'%(block[-block_size:],new_buffer)
            else:
                block='%s%s'%(block,new_buffer)
        lastpos = 0
        for mo in finditer(block):
            if mo.end()!=len(block):
                yield mo
                lastpos = mo.end()
            else:
                break
    if lastpos:
        block=block[lastpos:]
    for mo in finditer(block):
        yield mo

要测试或探索，你可以运行这个：

# NOTE: you can substitute a real file stream here for t_in but using this as a test
t_in=cStringIO.StringIO('testing this is a 1regexxx\nanother 2regexx\nmore 3regexes')
block_size=len('testing this is a regex')
re_pattern=re.compile(r'\dregex+',re.DOTALL)
for match_obj in regex_stream(re_pattern,t_in,block_size=block_size):
    print 'found regex in block of len %s/%s: "%s[[[%s]]]%s"'%(
        len(match_obj.string),
        block_size,match_obj.string[:match_obj.start()].encode('string_escape'),
        match_obj.group(),
        match_obj.string[match_obj.end():].encode('string_escape'))

这里是输出结果：

found regex in block of len 46/23: "testing this is a [[[1regexxx]]]\nanother 2regexx\nmor"
found regex in block of len 46/23: "testing this is a 1regexxx\nanother [[[2regexx]]]\nmor"
found regex in block of len 14/23: "\nmore [[[3regex]]]es"

这个方法在快速解析大型XML时非常有用，可以根据子元素作为根节点将其拆分成小的DOM，而不必在使用SAX解析器时处理回调和状态。它也能让你更快地过滤XML。不过我也用它做了很多其他事情。我有点惊讶这样的做法在网上没有更广泛的分享！

还有一点：只要传入的流生成的是unicode字符串，解析unicode应该是可以的。如果你使用像\w这样的字符类，你需要在re.compile模式构造中添加re.U标志。在这种情况下，block_size实际上是指字符数，而不是字节数。

回答于 2025-04-16 由 Python大师

分享举报

下面的代码在处理大约2GB大小的测试文件时效果还不错。

def search_file(pattern, filename, offset=0):
    with open(filename) as f:
        f.seek(offset)
        for line in f:
            m = pattern.search(line)
            if m:
                search_offset = f.tell() - len(line) - 1
                return search_offset + m.start(), search_offset + m.end()

请注意，正则表达式不能跨越多行。

回答于 2025-04-16 由 Python大师

分享举报

查找大文件的一种方法是使用mmap库，这个库可以把文件映射到一块大的内存区域。这样，你就可以在这块内存中搜索，而不需要逐行读取文件。

比如，可以这样做：

size = os.stat(fn).st_size
f = open(fn)
data = mmap.mmap(f.fileno(), size, access=mmap.ACCESS_READ)

m = re.search(r"867-?5309", data)

这种方法对于非常大的文件效果很好（我曾经处理过一个超过30GB的文件，但如果你的文件超过一两GB，就需要使用64位的操作系统）。

回答于 2025-04-16 由 Python大师

分享举报

Python: 在文件中查找正则表达式

3 个回答

撰写回答