两个可迭代对象的zip()替代方案

13 投票
4 回答
10863 浏览
提问于 2025-04-15 19:39

我有两个很大的文本文件(大约100GB),需要同时处理它们。

使用zip对小文件来说效果很好,但我发现它实际上是把我两个文件的每一行都列出来。这意味着每一行都要存储在内存里。我只需要处理每一行一次,不需要重复使用。

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after this.

有没有其他方法可以替代zip(),让它像生成器一样工作,这样我就可以在不占用超过200GB内存的情况下处理这两个文件呢?

4 个回答

0

如果你想把文件缩短到最短的长度:

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

try:
    while 1:
        i = handle1.next()
        j = handle2.next()

        do something with i and j.
        write to an output file.

except StopIteration:
    pass

finally:
    handle1.close()
    handle2.close()

否则

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

i_ended = False
j_ended = False
while 1:
    try:
        i = handle1.next()
    except StopIteration:
        i_ended = True
    try:
        j = handle2.next()
    except StopIteration:
        j_ended = True

        do something with i and j.
        write to an output file.
    if i_ended and j_ended:
        break

handle1.close()
handle2.close()

或者

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

while 1:
    i = handle1.readline()
    j = handle2.readline()

    do something with i and j.
    write to an output file.

    if not i and not j:
        break
handle1.close()
handle2.close()
16

你可以这样使用 izip_longest 来给较短的文件填充空行

python 2.6

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

或者在 Python 3+

from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in zip_longest(handle1, handle2, fillvalue=""):
        ...
22

itertools 这个库里有一个函数叫 izip,可以实现你想要的功能。

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

如果你要处理的文件大小不一样,可以使用 izip_longest,因为 izip 会在遇到较小的文件时就停止。

撰写回答