使用Python的histogram2d计算均值区间值
在Python中,如何计算二维直方图中每个区间的平均值呢?我有温度范围作为x轴和y轴,现在想用这些温度区间来绘制闪电发生的概率。我是从一个csv文件中读取数据,下面是我的代码:
filename = 'Random_Events_All_Sorted_85GHz.csv'
df = pd.read_csv(filename)
min37 = df.min37
min85 = df.min85
verification = df.five_min_1
#Numbers
x = min85
y = min37
H = verification
#Estimate the 2D histogram
nbins = 4
H, xedges, yedges = np.histogram2d(x,y,bins=nbins)
#Rotate and flip H
H = np.rot90(H)
H = np.flipud(H)
#Mask zeros
Hmasked = np.ma.masked_where(H==0,H)
#Plot 2D histogram using pcolor
fig1 = plt.figure()
plt.pcolormesh(xedges,yedges,Hmasked)
plt.xlabel('min 85 GHz PCT (K)')
plt.ylabel('min 37 GHz PCT (K)')
cbar = plt.colorbar()
cbar.ax.set_ylabel('Probability of Lightning (%)')
plt.show()
这个代码能生成一个看起来不错的图表,但图上显示的数据其实是每个区间内样本的数量,也就是计数。这里的验证变量是一个数组,里面有1和0,1表示有闪电,0表示没有闪电。我希望图上显示的数据是每个区间内闪电的概率,所以我需要将这个计数转换成百分比,也就是需要计算bin_mean*100。
我尝试过用类似于这里展示的方法(用scipy/numpy在Python中分区数据),但在处理二维直方图时遇到了一些困难。
2 个回答
8
有一个简单又快速的方法可以做到这一点!使用 weights
参数来对数值进行求和:
denominator, xedges, yedges = np.histogram2d(x,y,bins=nbins)
nominator, _, _ = np.histogram2d(x,y,bins=[xedges, yedges], weights=verification)
所以你只需要在每个区间里,把数值的总和除以事件的数量就可以了:
result = nominator / denominator.clip(1)
好了!
1
这个方法是可以实现的,至少可以用以下这种方式。
# xedges, yedges as returned by 'histogram2d'
# create an array for the output quantities
avgarr = np.zeros((nbins, nbins))
# determine the X and Y bins each sample coordinate belongs to
xbins = np.digitize(x, xedges[1:-1])
ybins = np.digitize(y, yedges[1:-1])
# calculate the bin sums (note, if you have very many samples, this is more
# effective by using 'bincount', but it requires some index arithmetics
for xb, yb, v in zip(xbins, ybins, verification):
avgarr[yb, xb] += v
# replace 0s in H by NaNs (remove divide-by-zero complaints)
# if you do not have any further use for H after plotting, the
# copy operation is unnecessary, and this will the also take care
# of the masking (NaNs are plotted transparent)
divisor = H.copy()
divisor[divisor==0.0] = np.nan
# calculate the average
avgarr /= divisor
# now 'avgarr' contains the averages (NaNs for no-sample bins)
如果你事先知道了每个区间的边界,那么你只需要多加一行,就可以在同样的方式下完成直方图的部分。