使用pandas groupby时,np.average在数据缺失时无法工作
我在用pandas的groupby和numpy的np.average计算加权平均值时遇到了问题。问题似乎出在数据中有缺失值(也就是数据缺失,而不是权重缺失)。我下面做了一个概念性的例子。
我希望的行为是,当数据缺失时,这条记录的权重也被忽略。简单地删除这一行数据不行,因为其他数据列里还有数据。我以为np.ma.average正好能解决这个问题,但结果还是给我返回了NaN。
有没有什么建议呢?
df = pd.DataFrame({ 'groups': ['a','a','b','a','b','b'],
'data': [3, 3, 4, 2, 2.5, np.nan],
'Weights': [1, 2, 1, 3, 1, 3]})
def wavg(subdf):
series = pd.Series()
for column in df.columns:
series['np.mean'] = np.mean(subdf['data'])
series['np.average (no weights)'] = np.average(subdf['data'])
series['np.average (weighted)'] = np.average(subdf['data'], weights=subdf['Weights'])
series['np.ma.average (weighted)'] = np.ma.average(subdf['data'], weights=subdf['Weights'])
return series
df.groupby('groups').apply(wavg)
这样做给我的结果是
np.mean np.average np.average np.ma.average
(no weights) (weighted) (weighted)
groups
a 2.666667 2.666667 2.5 2.5
b 3.250000 NaN NaN NaN
====================================
对好奇的人来说,这就是我最后使用的:
def wavg(subdf):
series = pd.Series()
for column in columns:
df = subdf.dropna(subset=[column])
if len(df) == 0:
series[str(column)] = np.nan
else:
series[str(column)] = np.average( df[column], weights=df['Weights'])
return series
1 个回答
1
因为 np.average
这个函数自己不能处理 nan
(表示缺失值),所以你需要自己来处理这些缺失值。最简单的方法就是在对 subdf
进行任何操作之前,先把它筛选一下。你可以在 wavg
的开头加上 subdf = subdf.dropna(subset=['data'])
,这样就能把 "data" 列中有缺失值的行去掉了。
def wavg(subdf):
series = pd.Series()
subdf = subdf.dropna(subset=['data'])
series['np.mean'] = np.mean(subdf['data'])
series['np.average (no weights)'] = np.average(subdf['data'])
series['np.average (weighted)'] = np.average(subdf['data'], weights=subdf['Weights'])
series['np.ma.average (weighted)'] = np.ma.average(subdf['data'], weights=subdf['Weights'])
return series
正如我在评论中提到的,我把 wavg
中的循环去掉了。你每个组只需要返回一组平均值(也就是一个均值、一个平均数、一个加权平均数和一个掩码平均数)。但是你用的循环会让你对每个组重复计算同样的东西四次(因为你的数据框中有四列)。