两个可迭代对象的zip()替代方案
我有两个很大的文本文件(大约100GB),需要同时处理它们。
使用zip对小文件来说效果很好,但我发现它实际上是把我两个文件的每一行都列出来。这意味着每一行都要存储在内存里。我只需要处理每一行一次,不需要重复使用。
handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')
for i, j in zip(handle1, handle2):
do something with i and j.
write to an output file.
no need to do anything with i and j after this.
有没有其他方法可以替代zip(),让它像生成器一样工作,这样我就可以在不占用超过200GB内存的情况下处理这两个文件呢?
4 个回答
0
如果你想把文件缩短到最短的长度:
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
try:
while 1:
i = handle1.next()
j = handle2.next()
do something with i and j.
write to an output file.
except StopIteration:
pass
finally:
handle1.close()
handle2.close()
否则
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
i_ended = False
j_ended = False
while 1:
try:
i = handle1.next()
except StopIteration:
i_ended = True
try:
j = handle2.next()
except StopIteration:
j_ended = True
do something with i and j.
write to an output file.
if i_ended and j_ended:
break
handle1.close()
handle2.close()
或者
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
while 1:
i = handle1.readline()
j = handle2.readline()
do something with i and j.
write to an output file.
if not i and not j:
break
handle1.close()
handle2.close()
16
你可以这样使用 izip_longest 来给较短的文件填充空行
在 python 2.6 中
from itertools import izip_longest
with handle1 as open('filea', 'r'):
with handle2 as open('fileb', 'r'):
for i, j in izip_longest(handle1, handle2, fillvalue=""):
...
或者在 Python 3+ 中
from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'):
for i, j in zip_longest(handle1, handle2, fillvalue=""):
...
22
itertools
这个库里有一个函数叫 izip
,可以实现你想要的功能。
from itertools import izip
for i, j in izip(handle1, handle2):
...
如果你要处理的文件大小不一样,可以使用 izip_longest
,因为 izip
会在遇到较小的文件时就停止。