Numpy/pandas优化：垃圾箱计数

import numpy as np import pandas as pd bins = pd.DataFrame({'from': np.arange(0, 1, 0.01), 'to': np.arange(0, 1, 0.01) + 0.1}) x = np.random.rand(1000000) bins['N'] = bins.apply(lambda r: ((x >= r['from']) & (x < r['to'])).sum(), axis=1)

2条回答

网友

1楼 · 编辑于 2024-06-07 13:09:54

间隔是规则大小的这一事实可能会被滥用，从而大大加快代码的速度。因此，通过设置参数，可以使用^{}，如下-

# First off, filter out elements that are outside the min,max limits.
# Then subtract min_val from the filtered elements so that they all start from 0
# Then, scale them w.r.t width and floor them, thus converting them into IDs
IDs = ((x[(x >= min_val) & (x<=max_val)]-min_val)/width).astype(int)

# Finally count those IDs, which is the desired output as new column
bins['N'] = np.bincount(IDs)

因此，对于发布的示例，我们将参数设置为：

^{pr2}$

样本运行-

In [156]: # Params
     ...: min_val = 4
     ...: max_val = 8
     ...: width = 0.4
     ...: 
     ...: # Create inputs
     ...: bins = pd.DataFrame({'from': np.arange(4, 8, 0.4), 'to': 
     ...:                                   np.arange(4, 8, 0.4) + 0.4})
     ...: x = 10*np.random.rand(1000)
     ...: 

In [157]: bins['N'] = bins.apply(lambda r:  ((x >= r['from']) & \
     ...:                                      (x < r['to'])).sum(), axis=1)

In [158]: bins
Out[158]: 
   from   to   N
0   4.0  4.4  42
1   4.4  4.8  40
2   4.8  5.2  36
3   5.2  5.6  43
4   5.6  6.0  45
5   6.0  6.4  29
6   6.4  6.8  40
7   6.8  7.2  46
8   7.2  7.6  41
9   7.6  8.0  45

In [159]: IDs = ((x[(x >= min_val) & (x<=max_val)]-min_val)/width).astype(int)

In [160]: np.bincount(IDs)
Out[160]: array([42, 40, 36, 43, 45, 29, 40, 46, 41, 45])

网友

2楼 · 编辑于 2024-06-07 13:09:54

如果“…边界具有固定宽度，如[[min+0 width，min+1 width]，[min+1 width，min+2 width]，…，[max-1 width，max]]…”，则使用numpy.histogram：

bins["N"] = numpy.histogram(x, numpy.concatenate([bins["from"], bins["to"].tail(1)]))[0]

这会比这简单，但是如果你有最后一条边在箱子里[“到”]，你需要把它包括在箱子边缘的列表中。在

有关详细信息：http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

相关问题更多 >

编程相关推荐

热门问题

热门文章