使用Python的Numpy从CSV计算均值
我有一个10GB的文件(太大了,放不进内存),它的格式是:
Col1,Col2,Col3,Col4
1,2,3,4
34,256,348,
12,,3,4
这个文件里有很多列,还有一些缺失的值。我想计算第2列和第3列的平均值。如果用普通的Python,我会这样做:
def means(rng):
s, e = rng
with open("data.csv") as fd:
title = next(fd)
titles = title.split(',')
print "Means for", ",".join(titles[s:e])
ret = [0] * (e-s)
for c, l in enumerate(fd):
vals = l.split(",")[s:e]
for i, v in enumerate(vals):
try:
ret[i] += int(v)
except ValueError:
pass
return map(lambda s: float(s) / (c + 1), ret)
不过我觉得用numpy可能有更快的方法来完成这个任务(我对numpy还不太熟悉)。
2 个回答
2
试试这个:
import numpy
# read from csv into record array
df = numpy.genfromtxt('test.csv',delimiter=',', usecols=(1,2), skip_header=1, usemask=True)
# calc means on columns
ans = numpy.mean(dat, axis=0)
ans.data 会包含一个数组,里面是所有列的平均值。
更新问题的编辑
如果你有一个10G的文件,你也可以用numpy来分块处理。可以参考这个 回答。
像这样:
sums = numpy.array((0,0))
counts = numpy.array((0,0))
fH = open('test.csv')
fH.readline() # skip header
while True:
try:
df = numpy.genfromtxt(itertools.islice(fH, 1000), delimiter=',', usecols=(1,2), usemask=True)
except StopIteration:
break
sums = sums + numpy.sum(df, 0)
counts = counts + numpy.sum(df.mask == False, 0)
fH.close()
means = sums / counts
4
Pandas 是你最好的朋友:
from pandas.io.parsers import read_csv
from numpy import sum
# Load 10000 elements at a time, you can play with this number to get better
# performance on your machine
my_data = read_csv("data.csv", chunksize=10000)
total = 0
count = 0
for chunk in my_data:
# If you want to exclude NAs from the average, remove the next line
chunk = chunk.fillna(0.0)
total += chunk.sum(skipna=True)
count += chunk.count()
avg = total / count
col1_avg = avg["Col1"]
# ... etc. ...