Pandas 在滑动窗口上进行滚动计算（不等距）

13 投票

4 回答

8646 浏览

提问于 2025-04-17 14:20

假设你有一些不规则的时间序列数据：

import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()


2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499
2013-02-01 09:00:00.003838    797
2013-02-01 09:00:00.004727    295
2013-02-01 09:00:00.006287    253

假设我想在1毫秒的窗口内计算滚动总和，得到这样的结果：

2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499 + 995
2013-02-01 09:00:00.003838    797 + 499 + 995
2013-02-01 09:00:00.004727    295 + 797 + 499
2013-02-01 09:00:00.006287    253

目前，我把所有数据转换成长整型，然后在cython中处理，但有没有可能只用pandas来实现呢？我知道可以用.asfreq('U')来调整频率，然后填充数据并使用传统的函数，但一旦数据量超过玩具级别，这种方法就不太好用了。

作为参考，这里有一个比较hack的、速度不快的Cython版本：

%%cython
import numpy as np
cimport cython
cimport numpy as np

ctypedef np.double_t DTYPE_t

def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
    cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
    assert(t_len==s_len)
    for i in range(0,t_len):
        window_start = times[i] - win_size
        j = i
        while times[j]>= window_start and j>=0:
            res[i] += to_add[j]
            j-=1
    return res

在一个稍大的数据序列上演示这个：

ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()

%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop

数据处理数据填充时间序列 pandas库滑动窗口不等距计算 Cython优化

4 个回答

也许使用 rolling_sum 这个功能会更合适：

pd.rolling_sum(ts, window=1, freq='1ms')

回答于 2025-04-17 由 Python大师

分享举报

这个问题虽然比较老，但如果你是从谷歌上找到这个内容的，可能会觉得有用：在pandas 0.19版本中，这个功能已经内置了，可以直接使用。

你可以查看这个链接了解更多信息：http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling

如果你想要获取1毫秒的窗口数据，你可以通过下面的方式获得一个Rolling对象：

dft.rolling('1ms')

然后你可以计算它的总和，结果会是：

dft.rolling('1ms').sum()

回答于 2025-04-17 由 Python大师

分享举报

你可以用累积和（cumsum）和二分查找来解决大部分这类问题。

from datetime import timedelta

def msum(s, lag_in_ms):
    lag = s.index - timedelta(milliseconds=lag_in_ms)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})


                            a  a_msum_100
2013-02-01 09:00:00.073479  5           5
2013-02-01 09:00:00.083717  8          13
2013-02-01 09:00:00.162707  1          14
2013-02-01 09:00:00.171809  6          20
2013-02-01 09:00:00.240111  7          14
2013-02-01 09:00:00.258455  0          14
2013-02-01 09:00:00.336564  2           9
2013-02-01 09:00:00.536416  3           3
2013-02-01 09:00:00.632439  4           7
2013-02-01 09:00:00.789746  9           9

[10 rows x 2 columns]

你需要一种处理NaN（缺失值）的方法，根据你的应用场景，你可能需要在滞后时间点的有效值，或者不需要（也就是说，使用kdb+的bin和np.searchsorted之间的区别）。

希望这对你有帮助。

回答于 2025-04-17 由 Python大师

分享举报

Pandas 在滑动窗口上进行滚动计算（不等距）

4 个回答

撰写回答