在Python中并行读取文件

0 投票

3 回答

2444 浏览

数据工程师

提问于 2025-04-16 18:52

我有一堆文件（差不多100个），里面的数据格式是：

（人数） \t （平均年龄）

这些文件是通过对某个特定人群进行随机走动生成的。每个文件有100,000行，记录了从1到100,000的人口规模的平均年龄。每个文件对应一个发展中国家的不同地方。我们将把这些值与发达国家类似规模地方的平均年龄进行比较。

我想做的是：

for each i (i ranges from 1 to 100,000):
  Read in the first 'i' values of average-age
  perform some statistics on these values

这意味着，对于每一次运行 i（i的范围是从1到100,000），读取前 i 个平均年龄的值，把它们加到一个列表中，然后进行一些测试（比如Kolmogorov-Smirnov检验或卡方检验）。

为了同时打开所有这些文件，我觉得最好的方法是用一个文件对象的字典。但是我在尝试进行上述操作时遇到了困难。

我的方法在复杂性上是最优的吗？

有没有更好的方法？

文件读取数据分析并行处理数据比较卡方检验统计检验平均年龄 Kolmogorov-Smirnov检验

3 个回答

我不太确定我是否喜欢这种方法，但它可能对你有用。这个方法可能会占用大量内存，但也许能满足你的需求。我假设你的数据文件是按编号排列的。如果不是的话，可能需要做一些调整。

# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]

# loop for the number of lines.
for line in range(100000):
  lines = [fh.readline() for fh in handles]

  # Some sort of processing for the list of lines.

这个方法可能接近你需要的效果，但我还是不太确定我喜欢它。如果你的文件行数不一致，这个方法可能会遇到问题。

回答于 2025-04-16 由 Python大师

分享举报

其实，可以在内存中存放1000万行数据。

你可以创建一个字典，字典的键是人数，值是平均年龄的列表，每个列表的元素来自不同的文件。也就是说，如果你有100个文件，那么每个列表就会有100个元素。

这样，你就不需要把文件对象存放在dict里了。

希望这对你有帮助。

回答于 2025-04-16 由 Python大师

分享举报

为什么不采取一个简单的方法呢：

依次打开每个文件，读取它的行，把内容放到内存中的一个数据结构里
在这个内存中的数据结构上进行统计

下面是一个完整的例子，里面有3个“文件”，每个文件包含3行内容。为了方便，这里用的是StringIO，而不是实际的文件：

#!/usr/bin/env python
# coding: utf-8

from StringIO import StringIO

# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'

files = [f1, f2, f3]

# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []

for i,filename in enumerate(files):
    f = StringIO(filename)
    # f = open(filename, 'r')
    data.append(dict())

    for line in f:
        population, average_age = (int(s) for s in line.split('\t'))
        data[i][population] = average_age

print data

# gather custom statistics on the data

# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'

输出结果是：

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old

回答于 2025-04-16 由 Python大师

分享举报

在Python中并行读取文件

3 个回答

撰写回答