Python文本文件处理速度问题

9 投票

5 回答

4156 浏览

提问于 2025-04-15 21:55

我在用Python处理一个比较大的文件时遇到了问题。我现在做的就是

f = gzip.open(pathToLog, 'r')
for line in f:
        counter = counter + 1
        if (counter % 1000000 == 0):
                print counter
f.close

光是打开这个文件、读行和增加计数器，就花了大约10分25秒。

而在Perl中，处理同样的文件，做的事情还多一些（还涉及一些正则表达式），整个过程只花了大约1分17秒。

Perl代码：

open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
        if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
                $_ =~ s/some regex here/$1,$2,$3,$4/;
                push @an_array, $_
        }
}
close LOG;

有没有人能给我建议，怎么才能让Python的处理速度和Perl的差不多呢？

补充说明 我试着直接解压文件，用open而不是gzip.open来处理，但这样总时间也只变成了大约4分14.972秒，还是太慢了。

我还把取余和打印的语句去掉，换成了pass，现在做的只是从一个文件移动到另一个文件。

正则表达式性能优化数据处理文件处理编程语言比较处理速度代码效率文件读写

5 个回答

我花了一些时间在这个问题上。希望这段代码能解决你的问题。它使用了 zlib 库，并且没有调用外部的东西。

gunzipchunks 方法会把压缩的 gzip 文件分成小块来读取，这样你就可以逐块处理（就像一个生成器）。

gunziplines 方法则会读取这些解压后的小块，每次给你一行内容，同样也可以逐行处理（又是一个生成器）。

最后，gunziplinescounter 方法会提供你想要的结果。

祝好运！

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

这个方法在处理非常大的文件时，运行速度应该会比使用内置的 gzip 模块快很多。

回答于 2025-04-15 由 Python大师

分享举报

如果你在网上搜索“为什么Python的gzip速度慢”，你会发现很多讨论，包括对Python 2.7和3.2的一些改进补丁。与此同时，你可以像在Perl中那样使用zcat，这个速度非常快。你的第一个函数处理一个5MB的压缩文件大约需要4.19秒，而第二个函数只需要0.78秒。不过，我不太清楚你那些未压缩文件的情况。如果我把日志文件（比如apache日志）解压后，用简单的Python打开文件和Popen('cat')来运行这两个函数，Python的速度（0.17秒）比cat命令（0.48秒）还快。

#!/usr/bin/python

import gzip
from subprocess import PIPE, Popen
import sys
import timeit

#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""

def test_ori():
    counter = 0
    f = gzip.open(pathToLog, 'r')
    for line in f:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line
    f.close

def test_new():
    counter = 0
    content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
    for line in content:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line

if '__main__' == __name__:
    to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
    print "Original function time", to.timeit(1)

    tn = timeit.Timer('test_new()', 'from __main__ import test_new')
    print "New function time", tn.timeit(1)

回答于 2025-04-15 由 Python大师

分享举报

在Python（至少在版本<= 2.6.x）中，gzip格式的解析是通过Python实现的（基于zlib）。而且，它似乎在做一些奇怪的事情，具体来说，就是先将文件解压到内存的末尾，然后再丢弃超出请求读取大小的部分（接着在下一次读取时再做一次）。免责声明：我只看了gzip.read()三分钟，所以我可能理解错了。不管我对gzip.read()的理解是否正确，gzip模块似乎并没有针对大数据量进行优化。可以尝试像在Perl中那样，启动一个外部进程（例如，查看subprocess模块）。

编辑实际上，我错过了原作者提到的普通文件I/O和压缩文件一样慢的说法（感谢ire_and_curses指出这一点）。这让我觉得不太可能，所以我做了一些测量……

from timeit import Timer

def w(n):
    L = "*"*80+"\n"
    with open("ttt", "w") as f:
        for i in xrange(n) :
            f.write(L)

def r():
    with open("ttt", "r") as f:
        for n,line in enumerate(f) :
            if n % 1000000 == 0 :
                print n

def g():
    f = gzip.open("ttt.gz", "r")
    for n,line in enumerate(f) :
        if n % 1000000 == 0 :
        print n

现在，运行它……

>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)

……在喝完茶，发现它还在运行后，我把它杀掉了，抱歉。然后我尝试了10万行，而不是1000万行：

>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895

gzip模块的时间复杂度是O(file_size**2)，所以当行数达到百万级时，gzip的读取时间绝对不可能和普通读取时间相同（这一点通过实验得到了证实）。Anonymouslemming，请再检查一下。

回答于 2025-04-15 由 Python大师

分享举报

Python文本文件处理速度问题

5 个回答

撰写回答