使用线性插值对不规则时间序列进行正则化

30 投票

4 回答

24363 浏览

提问于 2025-04-18 16:47

我在pandas中有一个时间序列，长得像这样：

                     Values
1992-08-27 07:46:48    28.0  
1992-08-27 08:00:48    28.2  
1992-08-27 08:33:48    28.4  
1992-08-27 08:43:48    28.8  
1992-08-27 08:48:48    29.0  
1992-08-27 08:51:48    29.2  
1992-08-27 08:53:48    29.6  
1992-08-27 08:56:48    29.8  
1992-08-27 09:03:48    30.0

我想把它重新采样成一个规律的时间序列，时间间隔是15分钟，并且希望值是线性插值的。简单来说，我想得到的是：

                     Values
1992-08-27 08:00:00    28.2  
1992-08-27 08:15:00    28.3  
1992-08-27 08:30:00    28.4  
1992-08-27 08:45:00    28.8  
1992-08-27 09:00:00    29.9

但是，当我使用pandas的重采样方法（df.resample('15Min')）时，我得到的是：

                     Values
1992-08-27 08:00:00   28.20  
1992-08-27 08:15:00     NaN  
1992-08-27 08:30:00   28.60  
1992-08-27 08:45:00   29.40  
1992-08-27 09:00:00   30.00

我尝试了用不同的how和fill_method参数来重采样，但始终没有得到我想要的结果。我是不是用错方法了？

pandas 时间序列线性插值重采样数据正则化

4 个回答

我最近需要对一些不均匀采样的加速度数据进行重采样。这些数据通常是在正确的频率下采集的，但有时会出现延迟，导致数据逐渐积累。

我找到了一个相关的问题，并结合了mstringer和Alberto Garcia-Rabosco的回答，使用了纯粹的pandas和numpy。这种方法会在你想要的频率上创建一个新的索引，然后进行插值，而不需要先在更高的频率上插值的那一步。

# from Alberto Garcia-Rabosco above
import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)

进行插值的代码：

import numpy as np
# create the new index and a new series full of NaNs
new_index = pd.DatetimeIndex(start='1992-08-27 08:00:00', 
    freq='15 min', periods=5, yearfirst=True)
new_series = pd.Series(np.nan, index=new_index)

# concat the old and new series and remove duplicates (if any) 
comb_series = pd.concat([s, new_series])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]

# interpolate to fill the NaNs
comb_series.interpolate(method='time', inplace=True)

输出结果：

>>> print(comb_series[new_index])
1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, dtype: float64

和之前一样，你可以使用scipy支持的任何插值方法，这个技巧也适用于DataFrame（我最开始就是用这个的）。最后要注意的是，插值默认使用的是“线性”方法，这种方法会忽略索引中的时间信息，因此不适用于不均匀间隔的数据。

回答于 2025-04-18 由 Python大师

分享举报

和@mstringer得到的结果一样，我们可以完全通过pandas来实现。诀窍是先按秒进行重采样，然后用插值法填补中间的值（.resample('s').interpolate()），接着再按15分钟的时间段进行上采样（.resample('15T').asfreq()）。

import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data).squeeze('columns')
s.index = pd.to_datetime(s.index)

res = s.resample('s').interpolate().resample('15T').asfreq().dropna()
print(res)

输出结果：

1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, Name: Values, dtype: float64

回答于 2025-04-18 由 Python大师

分享举报

你可以使用traces来实现这个功能。首先，像创建字典一样，用你的不规则测量数据创建一个TimeSeries：

ts = traces.TimeSeries([
    (datetime(1992, 8, 27, 7, 46, 48), 28.0),
    (datetime(1992, 8, 27, 8, 0, 48), 28.2),
    ...
    (datetime(1992, 8, 27, 9, 3, 48), 30.0),
])

然后使用sample方法进行规范化：

ts.sample(
    sampling_period=timedelta(minutes=15),
    start=datetime(1992, 8, 27, 8),
    end=datetime(1992, 8, 27, 9),
    interpolate='linear',
)

这样就得到了下面这个规范化的版本，灰色的点是原始数据，而橙色的点是通过线性插值得到的规范化版本。

插值后的值是：

1992-08-27 08:00:00    28.189 
1992-08-27 08:15:00    28.286  
1992-08-27 08:30:00    28.377
1992-08-27 08:45:00    28.848
1992-08-27 09:00:00    29.891

回答于 2025-04-18 由 Python大师

分享举报

这需要一些努力，但可以试试这个方法。基本的思路是找到每个重新采样点最近的两个时间戳，然后进行插值计算。这里用到的 np.searchsorted 是用来找到离重新采样点最近的日期。

# empty frame with desired index
rs = pd.DataFrame(index=df.resample('15min').iloc[1:].index)

# array of indexes corresponding with closest timestamp after resample
idx_after = np.searchsorted(df.index.values, rs.index.values)

# values and timestamp before/after resample
rs['after'] = df.loc[df.index[idx_after], 'Values'].values
rs['before'] = df.loc[df.index[idx_after - 1], 'Values'].values
rs['after_time'] = df.index[idx_after]
rs['before_time'] = df.index[idx_after - 1]

#calculate new weighted value
rs['span'] = (rs['after_time'] - rs['before_time'])
rs['after_weight'] = (rs['after_time'] - rs.index) / rs['span']
# I got errors here unless I turn the index to a series
rs['before_weight'] = (pd.Series(data=rs.index, index=rs.index) - rs['before_time']) / rs['span']

rs['Values'] = rs.eval('before * before_weight + after * after_weight')

经过这些步骤，希望能得到正确的答案：

In [161]: rs['Values']
Out[161]: 
1992-08-27 08:00:00    28.011429
1992-08-27 08:15:00    28.313939
1992-08-27 08:30:00    28.223030
1992-08-27 08:45:00    28.952000
1992-08-27 09:00:00    29.908571
Freq: 15T, Name: Values, dtype: float64

回答于 2025-04-18 由 Python大师

分享举报

使用线性插值对不规则时间序列进行正则化

4 个回答

撰写回答