使用Python多线程读取txt文件

28 投票

2 回答

41017 浏览

数据工程师

提问于 2025-04-17 04:22

我想用Python读取一个文件（逐行扫描并查找特定词汇），然后把结果写出来，比如每个词汇的计数。我需要对很多文件（超过3000个）进行这个操作。请问可以用多线程来实现吗？如果可以的话，怎么做呢？

具体情况是这样的：

读取每个文件并扫描它的每一行
把计数结果写到同一个输出文件中，包含我读取的所有文件的结果。

第二个问题是，这样做能提高读写的速度吗？

希望这样说清楚了。谢谢，

Ron。

性能优化多线程并发编程数据处理文件读取输出结果词汇计数

2 个回答

是的，这个可以通过并行的方式来实现。

不过，在Python中，用多个线程来实现并行处理比较困难。因此，multiprocessing模块是进行并行处理的更好选择。

至于你能期待达到什么样的速度提升，这很难说。因为这取决于有多少工作可以并行处理（越多越好），以及有多少工作必须串行处理（越少越好）。

回答于 2025-04-17 由 Python大师

分享举报

我同意 @aix 的看法，使用 multiprocessing 确实是个好主意。不管你遇到的是什么情况，输入输出的速度总是有限的——无论你同时运行多少个进程，读取速度都是有上限的。不过，还是有可能会有一些速度上的提升。

想象一下，下面这个例子（input/ 是一个包含多个 .txt 文件的文件夹，文件来自古腾堡计划）。

import os.path
from multiprocessing import Pool
import sys
import time

def process_file(name):
    ''' Process one file: count number of lines and words '''
    linecount=0
    wordcount=0
    with open(name, 'r') as inp:
        for line in inp:
            linecount+=1
            wordcount+=len(line.split(' '))

    return name, linecount, wordcount

def process_files_parallel(arg, dirname, names):
    ''' Process each file in parallel via Poll.map() '''
    pool=Pool()
    results=pool.map(process_file, [os.path.join(dirname, name) for name in names])

def process_files(arg, dirname, names):
    ''' Process each file in via map() '''
    results=map(process_file, [os.path.join(dirname, name) for name in names])

if __name__ == '__main__':
    start=time.time()
    os.path.walk('input/', process_files, None)
    print "process_files()", time.time()-start

    start=time.time()
    os.path.walk('input/', process_files_parallel, None)
    print "process_files_parallel()", time.time()-start

当我在我的双核电脑上运行这个时，速度明显变快了（虽然没有达到两倍快）：

$ python process_files.py
process_files() 1.71218085289
process_files_parallel() 1.28905105591

如果文件小到可以放进内存，而且你有很多处理任务不是受输入输出限制的，那么你应该能看到更大的提升。

回答于 2025-04-17 由 Python大师

分享举报

使用Python多线程读取txt文件

2 个回答

撰写回答