在Python中以内存最高效的方式读取缓冲区中的数据块

Question

我有一个文本文件，里面有很多行（大约几GB，差不多1200万行），每一行都是一个点的坐标x、y、z，还有一些附加信息。我想分块读取这个文件，处理每个点，然后根据点在0.25米的方格网中的位置，把结果分成几个文本文件，放在一个临时文件夹里。

449319.34;6242700.23;0.38;1;1;1;0;0;42;25;3;17;482375.326087;20224;23808;23808
449310.72;6242700.22;0.35;3;1;1;0;0;42;23;3;17;482375.334291;20480;24576;24576
449313.81;6242700.66;0.39;1;1;1;0;0;42;24;3;17;482375.342666;20224;24576;24576
449298.37;6242700.27;0.39;1;1;1;0;0;42;21;3;17;482375.350762;18176;22784;23552
449287.47;6242700.06;0.39;11;1;1;0;0;42;20;3;17;482375.358921;20736;24832;24832
449290.11;6242700.21;0.35;1;1;1;0;0;42;20;3;17;482375.358962;19968;24064;23808
449280.48;6242700.08;0.33;1;1;1;0;0;42;18;3;17;482375.367142;22528;25856;26624
449286.97;6242700.44;0.36;3;1;1;0;0;42;19;3;17;482375.367246;19712;23552;23296
449293.03;6242700.78;0.37;1;1;1;0;0;42;21;3;17;482375.367342;19456;23296;23808
449313.36;6242701.92;0.38;6;1;1;0;0;42;24;3;17;482375.367654;19968;24576;24576
449277.48;6242700.17;0.34;8;1;1;0;0;42;18;3;17;482375.375420;20224;23808;25088
449289.46;6242700.85;0.31;3;1;1;0;0;42;20;3;17;482375.375611;18944;23040;23040

这里的 ";" 是分隔符，前两列是x和y，这些信息对确定 ID位置 很有用。

输出的结果是另一些文本文件，每个ID只随机提取一个点。

例如：

    20;10;449319.34;6242700.23;0.38;1;1;1;0;0;42;25;3;17;482375.326087;20224;23808;23808
    20;10;449310.72;6242700.22;0.35;3;1;1;0;0;42;23;3;17;482375.334291;20480;24576;24576
    20;10;449313.81;6242700.66;0.39;1;1;1;0;0;42;24;3;17;482375.342666;20224;24576;24576
    20;10;449298.37;6242700.27;0.39;1;1;1;0;0;42;21;3;17;482375.350762;18176;22784;23552
    20;11;449287.47;6242700.06;0.39;11;1;1;0;0;42;20;3;17;482375.358921;20736;24832;24832
    20;11;449290.11;6242700.21;0.35;1;1;1;0;0;42;20;3;17;482375.358962;19968;24064;23808

这里前两列是ID。

最终的输出将是（示例）不包含ID值的结果。

         20;10;449313.81;6242700.66;0.39;1;1;1;0;0;42;24;3;17;482375.342666;20224;24576;24576
         20;11;449287.47;6242700.06;0.39;11;1;1;0;0;42;20;3;17;482375.358921;20736;24832;24832

我正在使用这个博客中的解决方案。

# File: readline-example-3.py

file = open("sample.txt")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        pass # do something

我的代码如下：

from __future__ import division
import os
import glob
import tempfile
import sys

def print_flulsh(n, maxvalue = None):
    sys.stdout.write("\r")
    if maxvalue is None:
        sys.stdout.write("Laser points processed: %d" % n)
    else:
        sys.stdout.write("%d of %d laser points processed" % (n, maxvalue))
    sys.stdout.flush()


def point_grid_id(x, y, minx, maxy, size):
    """give id (row,col)"""
    col = int((x - minx) / size)
    row = int((maxy - y) / size)
    return row, col


def tempfile_tile_name(line, temp_dir, minx, maxy, size, parse):
    x, y = line.split(parse)[:2]
    row, col = point_grid_id(float(x), float(y), minx, maxy, size)
    return os.path.normpath(os.path.join(temp_dir + os.sep,"tempfile_%s_%s.tmp" % (row, col)))

# split the text file in small text files following the ID value given by tempfile_tile_name
# where:
# filename : name+path of text file
# temp_dir: temporary folder
# minx, maxy: origin of the grid (left-up corner)
# size: size of the grid
# parse: delimeter of the text file
# num: number of lines (~ 12 millions)

def tempfile_split(filename, temp_dir, minx, maxy, size, parse, num):
    index = 1
    with open(filename) as file:
        while True:
            lines = file.readlines(100000)
            if not lines:
                break
            for line in lines:
                print_flulsh(index, num)
                index += 1
                name = tempfile_tile_name(line, temp_dir, minx, maxy, size, parse)
                with open(name, 'a') as outfile:
                    outfile.write(line)

我代码的主要问题是，当临时文件夹里保存了大约200万个分割的文本文件时，速度会变得很慢。我想知道，关于 effbot.org 的解决方案，是否有优化的方法来创建一个缓冲区？

数据处理坐标系统文本文件内存优化速度优化临时文件夹文件分块缓冲区管理

在Python中以内存最高效的方式读取缓冲区中的数据块

1 个回答

撰写回答