<p>从并行md5sum子进程收集输出的一种简单方法是使用线程池并从主进程写入文件:</p>
<pre><code>from multiprocessing.dummy import Pool # use threads
from subprocess import check_output
def md5sum(filename):
try:
return check_output(["md5sum", filename]), None
except Exception as e:
return None, e
if __name__ == "__main__":
p = Pool(number_of_processes) # specify number of concurrent processes
with open("md5sums.txt", "wb") as logfile:
for output, error in p.imap(md5sum, filenames): # provide filenames
if error is None:
logfile.write(output)
</code></pre>
<ul>
<li>来自<code>md5sum</code>的输出很小,因此可以将其存储在内存中</li>
<li><code>imap</code>保持顺序</li>
<li><code>number_of_processes</code>可能不同于文件或CPU内核的数量(较大的值并不意味着更快:它取决于IO(磁盘)和CPU的相对性能)</li>
</ul>
<p>您可以尝试一次将多个文件传递给md5sum子进程。</p>
<p>在这种情况下不需要外部子流程;<a href="https://stackoverflow.com/a/4964420/4279">you can calculate md5 in Python</a>:</p>
<pre><code>import hashlib
from functools import partial
def md5sum(filename, chunksize=2**15, bufsize=-1):
m = hashlib.md5()
with open(filename, 'rb', bufsize) as f:
for chunk in iter(partial(f.read, chunksize), b''):
m.update(chunk)
return m.hexdigest()
</code></pre>
<p>要使用多个进程而不是线程(允许纯Python <code>md5sum()</code>使用多个cpu并行运行),只需从上述代码的导入中删除<code>.dummy</code>。</p>