用于对同一时间戳进行求和的Numpy向量化算法

3 投票

3 回答

874 浏览

数据工程师

提问于 2025-04-17 05:36

我有两个数组 P 和 T。P[i] 是一个数字，而 T[i] 是对应的时间戳；可能会有重复的时间戳。

我想生成另外两个数组 Q 和 U，其中 Q[i] 的时间戳是 U[i]，而 Q[i] 是所有在时间戳 U[i] 下的 P 中元素的总和。

举个例子，对于：

P = [1, 2, 3, 4, 5]
T = [0, 0, 1, 1, 1]

我会得到：

Q = [3, 12]
U = [0, 1];

有没有什么快速的方法可以在 numpy 中实现这个，最好是能利用向量化的方式？

数据聚合 numpy 数组操作高效计算向量化重复元素时间戳处理数学求和

3 个回答

>>> P = [1, 2, 3, 4, 5]; T = [0, 0, 1, 1, 1]
>>> U = list(set(T))
>>> Q = [sum([p for (p,t) in zip(P,T) if t == u]) for u in U]
>>> print Q, U
[3, 12] [0, 1]

当然可以！请把你想要翻译的内容发给我，我会帮你把它变得简单易懂。

回答于 2025-04-17 由 Python大师

分享举报

import numpy as np
P = np.array([1, 2, 3, 4, 5]) 
T = np.array([0, 0, 1, 1, 1])

U = np.unique(T)
Q = np.array([P[T == u].sum() for u in U])

给出

In [17]: print Q, U
[3 12] [0 1]

这并不是完全的向量化，但比用列表的解决方案要快。

如果你想要更强大的这类分组功能，可以看看 pandas。

回答于 2025-04-17 由 Python大师

分享举报

使用numpy 1.4或更高版本：

import numpy as np

P = np.array([1, 2, 3, 4, 5]) 
T = np.array([0, 0, 1, 1, 1])

U,inverse = np.unique(T,return_inverse=True)
Q = np.bincount(inverse,weights=P)
print (Q, U)
# (array([  3.,  12.]), array([0, 1]))

请注意，这个方法不是最快的解决方案。我是这样测试速度的：

import numpy as np

N = 1000
P = np.repeat(np.array([1, 2, 3, 4, 5]),N)
T = np.repeat(np.array([0, 0, 1, 1, 1]),N)

def using_bincount():
    U,inverse = np.unique(T,return_inverse=True)
    Q = np.bincount(inverse,weights=P)
    return Q,U
    # (array([  3.,  12.]), array([0, 1]))

def using_lc():
    U = list(set(T))
    Q = [sum([p for (p,t) in zip(P,T) if t == u]) for u in U]
    return Q,U

def using_slice():
    U = np.unique(T)
    Q = np.array([P[T == u].sum() for u in U])
    return Q,U

对于小数组，wim的解决方案更快（N=1）：

% python -mtimeit -s'import test' 'test.using_lc()'
100000 loops, best of 3: 18.4 usec per loop
% python -mtimeit -s'import test' 'test.using_slice()'
10000 loops, best of 3: 66.8 usec per loop
% python -mtimeit -s'import test' 'test.using_bincount()'
10000 loops, best of 3: 52.8 usec per loop

对于大数组，joris的解决方案更快（N=1000）：

% python -mtimeit -s'import test' 'test.using_lc()'
100 loops, best of 3: 9.93 msec per loop
% python -mtimeit -s'import test' 'test.using_slice()'
1000 loops, best of 3: 390 usec per loop
% python -mtimeit -s'import test' 'test.using_bincount()'
1000 loops, best of 3: 846 usec per loop

我怀疑在这种情况下这是否重要，但基准测试的结果可能会因为numpy、python、操作系统或硬件的版本不同而有所变化。在你的机器上重复这些基准测试也没有坏处。

回答于 2025-04-17 由 Python大师

分享举报

用于对同一时间戳进行求和的Numpy向量化算法

3 个回答

撰写回答