检查和关闭Python生成器
我有一段代码,它通过生成器从两个大文件中读取数据,当其中一个文件读到结尾(EOF)时就停止。我想知道以下几点:(1) 是哪个生成器最先到达文件结尾,(2) 当第一个生成器到达结尾时,每个生成器的进度,也就是生成器中变量i
的值(见下面的代码),以及 (3) 另一个生成器中剩余的行数。我不知道每个文件的长度,也不想提前扫描文件。
我知道可以通过以下方式获取进度:
每次调用
next()
时增加一个计数器(这看起来不太好!),或者让生成器返回一个计数器(见代码中的
counter1
和counter2
),
但在这两种情况下,我都不知道是gen1
还是gen2
最先到达文件结尾。
我还发现可以在StopIteration
异常中添加一个“消息”,但我在想是否有更好的方法。在第一个try...except
块之后,我能否以某种方式找出哪个生成器还没有到达文件结尾,并让它继续读取?(我尝试过在生成器上使用close()
或throw()
,或者在生成器内部使用finally
子句,但我并没有真正理解它们。)
def gen1(fp):
for i, line in enumerate(fp):
int_val = process_line(line)
yield int_val, i
raise StopIteration, ("gen1", i)
def gen2(fp):
for i, line in enumerate(fp):
float_val = process_line_some_other_way(line)
yield float_val, i
raise StopIteration, ("gen2", i)
g1 = gen1(open('large_file', 'r'))
g2 = gen2(open('another_large_file', 'r'))
try:
val1, counter1 = next(g1)
val2, counter2 = next(g2)
progress += 1
while True: # actual code is a bit more complicated than shown here
while val1 > val2:
val2, counter2 = next(g2)
while val1 < val2:
val1, counter1 = next(g1)
if val1 == val2:
do_something()
val1, counter1 = next(g1)
val2, counter2 = next(g2)
except StopIteration as err:
first_gen_name, first_num_lines = err.args
gen1_finished_first = gen_name == 'gen1'
# Go through the rest of the other generator to get the total number of lines
the_remaining_generator = g2 if gen1_finished_first else g1
try:
while True:
next(the_remaining_generator)
except StopIteration as err:
second_gen_name, second_num_lines = err.args
if gen1_finished_first:
print 'gen1 finished first, it had {} lines.'.format(first_num_lines) # same as `counter1`
print 'gen2 was at line {} when gen1 finished.'.format(counter2)
print 'gen2 had {} lines total.'.format(second_num_lines)
else:
... # omitted
2 个回答
1
你可以使用 chain
来在你的生成器的末尾加一个特殊的 EOF
值。比如说:
from itertools import chain
EOF = object()
fin = open('somefile')
src = enumerate(chain(fin, [EOF]))
while True:
idx, row = next(src)
if row == EOF:
break # End of file
print idx, row
你也可以试试 izip_longest
。把 f1 和 f2 替换成你的生成器就可以了。
from itertools import count, izip_longest
EOF = object()
with open('f1') as f1, open('f2') as f2:
for i, r1, r2 in izip_longest(count(), f1, f2, fillvalue=EOF):
if EOF in (r1, r2):
print i, r1, r2
break
2
我觉得你可能想用 一个迭代器类,这样做比较好——它是用标准的Python类实现的,可以根据需要添加额外的属性(比如一个 exhausted
标志)。
可以参考下面的例子:
# untested
class file_iter():
def __init__(self, file_name):
self.file = open(file_name)
self.counted_lines = 0
self.exhausted = False
def __iter__(self):
return self
def __next__(self):
if self.exhausted:
raise StopIteration
try:
next_line = self.file.readline()
self.counted_lines += 1
return next_line
except EOFError:
self.file.close()
self.exhausted = True
raise StopIteration