在分箱数据上使用numpy百分位数

5 投票

1 回答

4643 浏览

提问于 2025-04-17 23:50

假设一个小镇的房屋销售数据是以范围的形式呈现的：

< $100,000              204
$100,000 - $199,999    1651
$200,000 - $299,999    2405
$300,000 - $399,999    1972
$400,000 - $500,000     872
> $500,000             1455

我想知道某个百分位数对应的房价区间。有没有办法用numpy的 percentile 函数来实现这个？我可以手动计算：

import numpy as np
a = np.array([204., 1651., 2405., 1972., 872., 1455.])
b = np.cumsum(a)/np.sum(a) * 100
q = 75
len(b[b <= q])
4       # ie bin $300,000 - $399,999

但是有没有办法用 np.percentile 来做呢？

数据分析数组操作百分位数房价区间

1 个回答

你已经快到了：

cs = np.cumsum(a)
bin_idx = np.searchsorted(cs, np.percentile(cs, 75))

至少在这个情况下（还有一些其他的情况，涉及到更大的 a 数组），速度并没有更快：

In [9]: %%timeit
   ...: b = np.cumsum(a)/np.sum(a) * 100
   ...: len(b[b <= 75])
   ...:
10000 loops, best of 3: 38.6 µs per loop

In [10]: %%timeit
   ....: cs = np.cumsum(a)
   ....: np.searchsorted(cs, np.percentile(cs, 75))
   ....:
10000 loops, best of 3: 125 µs per loop

所以，除非你想检查多个百分位数，否则我建议你继续使用你现在的方法。

回答于 2025-04-17 由 Python大师

分享举报

在分箱数据上使用numpy百分位数

1 个回答

撰写回答