Python 平均表格数据帮助
好的,我有一个可以正常工作的程序。它可以打开一个数据文件,这个文件的数据是按列排列的,数据量太大,Excel 处理不了,然后它会计算每一列的平均值:
示例数据是:
Joe Sam Bob
1 2 3
2 1 3
然后它返回:
Joe Sam Bob
1.5 1.5 3
这很好。不过问题是,有些列的值是 NA。我想跳过这些 NA,计算剩下值的平均数。所以
Bobby
1
NA
2
应该输出为:
Bobby
1.5
这是我现有的程序,之前在这里得到了一些帮助。任何帮助都非常感谢!
with open('C://avy.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
with open('c://finished.txt', 'w') as ouf:
for index, summedRowValue in enumerate(sums):
print>>ouf, columns[index], 1.0 * summedRowValue / numRows
现在我有这个:
with open('C://avy.txt', "rtU") as f:
def get_averages(f):
headers = f.readline().split()
ncols = len(headers)
sumx0 = [0] * ncols
sumx1 = [0.0] * ncols
lino = 1
for line in f:
lino += 1
values = line.split()
for colindex, x in enumerate(values):
if colindex >= ncols:
print >> sys.stderr, "Extra data %r in row %d, column %d" %(x, lino, colindex+1)
continue
try:
value = float(x)
except ValueError:
continue
sumx0[colindex] += 1
sumx1[colindex] += value
print headers
print sumx1
print sumx0
averages = [
total / count if count else None
for total, count in zip(sumx1, sumx0)
]
print averages
然后它显示:
追溯(最近的调用最后): 文件 "C:/avy10.py",第 11 行,在 lino += 1 NameError: name 'lino' is not defined
5 个回答
[为了更清晰而编辑]
当你从文本文件中读取内容时,这些内容会以字符串的形式导入,而不是数字。这意味着,如果你的文本文件里有数字 3
,当你把它读入Python时,你需要先把这个字符串转换成数字,才能进行数学运算。
现在,你有一个包含多列的文本文件。每一列都有一个标题和一系列的项目。每个项目可能是数字,也可能不是。如果是数字,使用 float
函数可以正确转换;如果不是有效的数字(也就是说,无法转换),那么转换时会出现一个叫 ValueError
的错误。
所以,你需要遍历你的列表和项目,正如之前的多个回答所解释的那样。如果可以转换成浮点数,就把这个数据累加起来;如果不能,就跳过这个条目。
如果你想了解什么是“鸭子类型”(这是一种可以简化为“宁可请求原谅也不要请求许可”的编程理念),可以查看这个维基百科链接。如果你开始学习Python,你会经常听到这个术语。
下面我展示了一个可以用来累积统计数据的类(你感兴趣的是平均值)。你可以为表格中的每一列使用这个类的一个实例。
class Accumulator(object):
"""
Used to accumulate the arithmetic mean of a stream of
numbers. This implementation does not allow to remove items
already accumulated, but it could easily be modified to do
so. also, other statistics could be accumulated.
"""
def __init__(self):
# upon initialization, the numnber of items currently
# accumulated (_n) and the total sum of the items acumulated
# (_sum) are set to zero because nothing has been accumulated
# yet.
self._n = 0
self._sum = 0.0
def add(self, item):
# the 'add' is used to add an item to this accumulator
try:
# try to convert the item to a float. If you are
# successful, add the float to the current sum and
# increase the number of accumulated items
self._sum += float(item)
self._n += 1
except ValueError:
# if you fail to convert the item to a float, simply
# ignore the exception (pass on it and do nothing)
pass
@property
def mean(self):
# the property 'mean' returns the current mean accumulated in
# the object
if self._n > 0:
# if you have more than zero items accumulated, then return
# their artithmetic average
return self._sum / self._n
else:
# if you have no items accumulated, return None (you could
# also raise an exception)
return None
# using the object:
# Create an instance of the object "Accumulator"
my_accumulator = Accumulator()
print my_accumulator.mean
# prints None because there are no items accumulated
# add one (a number)
my_accumulator.add(1)
print my_accumulator.mean
# prints 1.0
# add two (a string - it will be converted to a float)
my_accumulator.add('2')
print my_accumulator.mean
# prints 1.5
# add a 'NA' (will be ignored because it cannot be converted to float)
my_accumulator.add('NA')
print my_accumulator.mean
# prints 1.5 (notice that it ignored the 'NA')
祝好。
这里有一个可用的解决方案:
text = """Joe Sam Bob
1 2 3
2 1 3
NA 2 3
3 5 NA"""
def avg( lst ):
""" returns the average of a list """
return 1. * sum(lst)/len(lst)
# split that text
parts = [line.split() for line in text.splitlines()]
#remove the headers
names = parts.pop(0)
# zip(*m) does something like transpose a matrix :-)
columns = zip(*parts)
# convert to numbers and leave out the NA
numbers = [[int(x) for x in column if x != 'NA' ] for column in columns]
# all left is averaging
averages = [avg(col) for col in numbers]
# and printing
for name, x in zip( names, averages):
print name, x
我在这里写了很多列表推导式,这样你可以打印出中间的步骤,不过这些其实也可以用生成器来实现。
下面的代码能够正确处理不同数量的数据,并且还能检测到多余的数据……换句话说,它的功能比较强大。可以通过添加一些明确的提示来进一步改进,比如 (1) 如果文件是空的 (2) 如果表头是空的。还有一种可能性是专门检查一下是否是 "NA"
,如果某个字段既不是 "NA"
也不能转换成数字,就发出错误提示。
>>> import sys, StringIO
>>>
>>> data = """\
... Jim Joe Billy Bob
... 1 2 3 x
... 2 x x x 666
...
... 3 4 5 x
... """
>>>
>>> def get_averages(f):
... headers = f.readline().split()
... ncols = len(headers)
... sumx0 = [0] * ncols
... sumx1 = [0.0] * ncols
... lino = 1
... for line in f:
... lino += 1
... values = line.split()
... for colindex, x in enumerate(values):
... if colindex >= ncols:
... print >> sys.stderr, "Extra data %r in row %d, column %d" %
(x, lino, colindex+1)
... continue
... try:
... value = float(x)
... except ValueError:
... continue
... sumx0[colindex] += 1
... sumx1[colindex] += value
... print headers
... print sumx1
... print sumx0
... averages = [
... total / count if count else None
... for total, count in zip(sumx1, sumx0)
... ]
... print averages
编辑 在这里添加:
... return headers, averages
...
>>> sio = StringIO.StringIO(data)
>>> get_averages(sio)
Extra data '666' in row 3, column 5
['Jim', 'Joe', 'Billy', 'Bob']
[6.0, 6.0, 8.0, 0.0]
[3, 2, 2, 0]
[2.0, 3.0, 4.0, None]
>>>
编辑
正常使用:
with open('myfile.text') as mf:
hdrs, avgs = get_averages(mf)