并行文件解析，多CPU核

网友

1楼 · 编辑于 2024-04-19 04:33:53

将文件拆分为8个较小的文件
启动单独的脚本来处理每个文件
加入结果

为什么这是最好的方法。。。

这很简单-你不必以任何不同于线性处理的方式编程。
启动少量长时间运行的进程可以获得最佳性能。
操作系统将处理上下文切换和IO多路复用，因此您不必担心这些事情（操作系统做得很好）。
您可以扩展到多台计算机，而不必更改代码
。。。

网友

2楼 · 编辑于 2024-04-19 04:33:53

这可以使用Ray来完成，这是一个用于编写并行和分布式Python的库。

要运行下面的代码，首先按如下方式创建input.txt。

printf "1\n2\n3\n4\n5\n6\n" > input.txt

然后，您可以通过将@ray.remotedecorator添加到parse函数并并行执行多个副本来并行处理该文件，如下所示

import ray
import time

ray.init()

@ray.remote
def parse(line):
    time.sleep(1)
    return 'key' + str(line), 'value'

# Submit all of the "parse" tasks in parallel and wait for the results.
keys_and_values = ray.get([parse.remote(line) for line in open('input.txt')])
# Create a dictionary out of the results.
result = dict(keys_and_values)

注意，最佳的方法取决于运行parse函数需要多长时间。如果需要一秒钟（如上所述），那么解析每个Ray任务一行是有意义的。如果需要1毫秒，那么解析每个Ray任务的一堆行（例如100行）可能是有意义的。

您的脚本非常简单，因此也可以使用多处理模块，但是只要您想做任何更复杂的事情，或者想利用多台机器而不是一台机器，那么使用Ray就容易多了。

请参阅Ray documentation。

网友

3楼 · 编辑于 2024-04-19 04:33:53

cPython不提供您所寻找的线程模型。使用multiprocessing模块和process pool可以得到类似的结果

这样的解决方案可能看起来像这样：

def worker(lines):
    """Make a dict out of the parsed, supplied lines"""
    result = {}
    for line in lines.split('\n'):
        k, v = parse(line)
        result[k] = v
    return result

if __name__ == '__main__':
    # configurable options.  different values may work better.
    numthreads = 8
    numlines = 100

    lines = open('input.txt').readlines()

    # create the process pool
    pool = multiprocessing.Pool(processes=numthreads)

    # map the list of lines into a list of result dicts
    result_list = pool.map(worker, 
        (lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )

    # reduce the result dicts into a single dict
    result = {}
    map(result.update, result_list)

相关问题更多 >

编程相关推荐

热门问题

热门文章

并行文件解析，多CPU核

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >