为什么readline()在PIPE文件中这么慢？

1 投票

1 回答

942 浏览

提问于 2025-04-18 13:00

我正在尝试读取一个很大的gzipped（压缩过的）csv文件，并处理每一行。

我试了两种不同的方法：

结果发现，通常推荐的方法比另一种慢了100倍。我是不是搞错了，还是说使用Popen().stdout的实现真的很糟糕？（它似乎是一个一个字符地读取文件）。

from time import time
from subprocess import Popen, PIPE

# We generate a csv file with 1M lines of 3D coordinates
from random import random
import os

N = 1000000
PATH = 'test'
GZIP_PATH = 'test.gz'

with open(PATH, 'w') as datafile:
    for i in xrange(N):
        datafile.write('{0}, {1}, {2}\n'.format(random(), random(), random()))

try:
    os.remove(GZIP_PATH)
except:
    pass

Popen(['gzip', PATH]).wait()

# We want to process the file line by line

# We start with a textbook implementation

def simple_generator(file):
    line = file.readline()
    while line:
        yield line[:-1]
        line = file.readline()

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
    t = time()
    i = 0
    for line in simple_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

# We start a lower level implementation

BLOCK_SIZE = 1<<16
def fast_generator(file):
    rem = ''
    block = file.read(BLOCK_SIZE)
    while block:
        lines = block.split('\n')
        lines[0] = rem+lines[0]
        for i in xrange(0,len(lines)-1):
            yield lines[i]
        rem = lines[-1]
        block = file.read(BLOCK_SIZE)

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE).stdout as datafile:
    t = time()
    i = 0
    for line in fast_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

# Output:
#
# 34.0195429325
# 1000000
# 0.232397794724
# 1000000
#
# The second implementation is 100x faster!

性能优化文件读取流式处理 CSV处理 gzipped

1 个回答

正确的做法是使用 Popen 时设置 bufsize=-1。

with Popen(['gunzip', '-c', GZIP_PATH], stdout=PIPE, bufsize=-1).stdout as datafile:
    t = time()
    i = 0
    for line in simple_generator(datafile):
        i+=1 # process the line
    print time()-t
    print i

不过，我有点惊讶的是，默认情况下 bufsize 是 0。

回答于 2025-04-18 由 Python大师

分享举报

为什么readline()在PIPE文件中这么慢？

1 个回答

撰写回答