使用生成器对大型文件进行缓冲解析

2024-04-27 00:49:57 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个需要解析的大文件，因为每次脚本运行时都会从外部查询中重新生成它，所以无法解析一次并缓存结果。我想节省内存占用，只读取和解析该文件的逻辑“块”，从打开的“产品”到关闭的花括号之间的所有内容。不确定Python中的规范方法是什么，但我尝试了以下方法：

def read_chunk(file_name, pattern_open_line, pattern_close_line):
    with open(file_name,"r") as in_file:
        chunk = []
        in_chunk = False
        open_line = re.compile(pattern_open_line);
        close_line = re.compile(pattern_close_line)
        try:
            for line in in_file:
                line = line.strip()
                if in_chunk:
                    chunk.append(line)
                if close_line.match(line):
                    yield chunk
                if open_line.match(line):
                    chunk = []
                    chunk.append(line)
                    in_chunk = True
                    continue
        except StopIteration:
            pass

def get_products_buffered(infile):
    chunks = read_chunk(infile, '^product\s*$', '^\s*\}\s*')
    products = []
    for lines in chunks:
        for line in lines:
            if line.startswith('productNumber:'):
                 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
                 products.append(productNumber)
                 continue
    return products

def get_products_unbuffered(infile):
    with open(infile) as f:
        lines = f.readlines()
        f.close()
    products = []
    for line in lines:
        if line.startswith('productNumber:'):
             productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
             products.append(productNumber)
             continue
    return products

我分析了两次跑步，而无缓冲阅读速度更快：

Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523

当文件实质上被读入内存时，它还会导致更大的内存冲击：

Line #    Mem usage    Increment   Line Contents
================================================
    29     28.2 MiB      0.0 MiB   @profile
    30                             def get_products_buffered(infile):
    31     28.2 MiB      0.0 MiB       chunks = read_chunk(infile, '^product\s*$', '^\s*\}\s*')
    32     28.2 MiB      0.0 MiB       products = []
    33     30.1 MiB      1.9 MiB       for lines in chunks:

与：

Line #    Mem usage    Increment   Line Contents
================================================
    42     29.2 MiB      0.0 MiB   @profile
    43                             def get_products_unbuffered(infile):
    44     29.2 MiB      0.0 MiB       with open(infile) as f:
    45    214.5 MiB    185.2 MiB           lines = f.readlines()

Tags： in for close if def line open infile

0条回答

目前没有回答

使用生成器对大型文件进行缓冲解析

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用生成器对大型文件进行缓冲解析

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >