根据第二变量的分箱计算均值

5 投票

2 回答

4065 浏览

提问于 2025-04-17 19:30

我正在使用Python和Numpy。我的输入数据有很多对值，形式是(x,y)。我基本上想要绘制<y>(x)，也就是对于某个数据区间x，y的平均值。目前我使用的是普通的for循环来实现这个，这样做非常慢。

# create example data
x = numpy.random.rand(1000)
y = numpy.random.rand(1000)
# set resolution
xbins = 100
# find x bins
H, xedges, yedges = numpy.histogram2d(x, y, bins=(xbins,xbins) )
# calculate mean and std of y for each x bin
mean = numpy.zeros(xbins)
std = numpy.zeros(xbins)
for i in numpy.arange(xbins):
    mean[i] = numpy.mean(y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])
    std[i]  = numpy.std (y[ numpy.logical_and( x>=xedges[i], x<xedges[i+1] ) ])

有没有办法用一种更高效的方式来处理这个呢？

数据可视化均值计算 numpy优化数据分箱

2 个回答

如果你会用pandas这个库：

import pandas as pd
xedges = np.linspace(x.min(), x.max(), xbins+1)
xedges[0] -= 0.00001
xedges[-1] += 0.000001
c = pd.cut(x, xedges)
g = pd.groupby(pd.Series(y), c.labels)
mean2 = g.mean()
std2 = g.std(0)

回答于 2025-04-17 由 Python大师

分享举报

你把事情搞得太复杂了。其实你只需要知道，对于每个在 x 中的区间（也就是“箱子”），要找出 n、sy 和 sy2 这三个值。n 是这个 x 区间里有多少个 y 值，sy 是这些 y 值的总和，而 sy2 是这些 y 值的平方和。你可以这样得到这些值：

>>> n, _ = np.histogram(x, bins=xbins)
>>> sy, _ = np.histogram(x, bins=xbins, weights=y)
>>> sy2, _ = np.histogram(x, bins=xbins, weights=y*y)

从这些值中：

>>> mean = sy / n
>>> std = np.sqrt(sy2/n - mean*mean)

回答于 2025-04-17 由 Python大师

分享举报

根据第二变量的分箱计算均值

2 个回答

撰写回答