numpy的digitize函数能输出均值或中位数吗?
我有一些数据需要分组到不同的区间里。通常我们会用0、1、2、3等等来表示这些区间,但我想要的是每个区间的平均值或中位数。请问有没有办法做到这一点?
4 个回答
1
这是一个比unutbu的代码稍微简单一些,更通用的版本:
import numpy as np
x = np.tile(np.array([0.2, 9., 6.4, 3.0, 1.6]), 100000)
bins = np.array([0.0, 1.0, 2.5, 10.0])
def binstats(x, bins, funcs):
inds = np.digitize(x, bins)
inds2 = np.unique(inds)
statistics = []
binnumber = []
for bin_idx in inds2:
bin_arr = x[inds==bin_idx]
statistics.append([f(bin_arr) for f in funcs])
return statistics, inds2
statistics, binnumber = binstats(x, bins, [np.mean, np.median])
print(statistics)
for (mean, median), bin_idx in zip(statistics, binnumber):
print('{b}: {mean:.2f} {median:.2f}'.format(b=bin_idx, mean=mean, median=median))
这个版本可能更好,因为它让你可以在统计中使用任意数量的函数。而且提前创建这个集合可能会更快。
4
为了比较,这里展示了如何在 pandas
中写这种代码,使用了 groupby
和 pd.cut
(这和 np.digitize
类似):
>>> x = np.random.uniform(0, 10, 5*10**5)
>>> bins = np.array([0, 1, 2.5, 10])
>>> s = pd.Series(x)
>>> s.groupby(pd.cut(s, bins)).agg(["median", "mean"])
median mean
(0, 1] 0.500684 0.500641
(1, 2.5] 1.751121 1.751630
(2.5, 10] 6.243822 6.248801
[3 rows x 2 columns]
性能似乎和unutbu的numpy解决方案差不多(在稍微调整一下以接受参数之后):
>> %timeit binstats(x, bins)
10 loops, best of 3: 126 ms per loop
>>> %timeit onecall(x, bins)
10 loops, best of 3: 74.8 ms per loop
>>> %timeit twocalls(x, bins)
10 loops, best of 3: 109 ms per loop
>>> %timeit s.groupby(pd.cut(s, bins)).agg(["median", "mean"])
10 loops, best of 3: 72.5 ms per loop
如果你愿意牺牲一点优雅性,还可以再节省一些时间:
>>> %timeit s.groupby(np.digitize(x, bins)).agg(["median", "mean"])
10 loops, best of 3: 65.2 ms per loop
不过我使用 pandas
不是为了性能,我用它是因为它让很多常见的数据操作变得更加方便。
5
你可以通过只计算一次每个 bin_idx
的统计数据来加快 shx2 的代码速度。
import numpy as np
x = np.tile(np.array([0.2, 9., 6.4, 3.0, 1.6]), 100000)
bins = np.array([0.0, 1.0, 2.5, 10.0])
def binstats(x, bins):
inds = np.digitize(x, bins)
statistics = []
binnumber = []
seen = set()
for bin_idx in inds:
if bin_idx not in seen:
bin_arr = x[inds==bin_idx]
statistics.append([np.mean(bin_arr), np.median(bin_arr)])
binnumber.append(bin_idx)
seen.add(bin_idx)
return statistics, binnumber
statistics, binnumber = binstats(x, bins)
for (mean, median), bin_idx in zip(statistics, binnumber):
print('{b}: {mean:.2f} {median:.2f}'.format(b=bin_idx, mean=mean, median=median))
这样做会得到
1: 0.20 0.20
3: 6.13 6.40
2: 1.60 1.60
顺便说一下,如果你有安装 scipy,你也可以使用 scipy.stats.binned_statistic,不过性能并没有更好:
import scipy.stats as stats
# This is a hack to return two statistics with one call to binned_statistic. It reduces the precision of the statistics to `float32`.
def onecall():
statistics, bin_edges, binnumber = stats.binned_statistic(
x, values=x, bins=bins,
statistic=lambda grp: (np.array([grp.mean(), np.median(grp)])
.astype('float32').view('float64')))
return statistics.view('float32').reshape(-1, 2)
def twocalls():
means, bin_edges, binnumber = stats.binned_statistic(
x, values=x, statistic='mean', bins=bins)
medians, bin_edges, binnumber = stats.binned_statistic(
x, values=x, statistic='median', bins=bins)
return means, medians
In [284]: %timeit binstats(x, bins)
10 loops, best of 3: 85.6 ms per loop
In [285]: %timeit onecall()
10 loops, best of 3: 86.6 ms per loop
In [286]: %timeit twocalls()
10 loops, best of 3: 150 ms per loop
1
我没有那种不使用循环的解决方案(就像大多数numpy的问题需要的那样),不过假设你的箱子数量不多,而且数组也不是特别大,这个方法应该会比较快:
x = np.array([0.2, 9., 6.4, 3.0, 1.6])
bins = np.array([0.0, 1.0, 2.5, 10.0])
inds = np.digitize(x, bins)
inds
=> array([1, 3, 3, 3, 2])
for bin_idx in inds:
bin_arr = x[inds==bin_idx]
print bin_idx, np.mean(bin_arr), np.median(bin_arr)
=>
1 0.2 0.2
3 6.13333333333 6.4
3 6.13333333333 6.4
3 6.13333333333 6.4
2 1.6 1.6
要创建这个数组:
bin_means = np.array([ x[inds==bin_idx].mean() for bin_idx in inds ])