pandas.DataFrame.interpolate() 方法在时间序列数据中不正确插值或外推
我在使用pandas中的插值方法pandas.DataFrame.interpolate()时遇到了一些问题。我有一组时间序列数据,每个数据点之间大约相隔2分钟。我想把这些数据重新采样成正好每2分钟一个数据点,以便后面能和其他数据同步。问题是,温度和湿度的值插值结果似乎不太对。我尝试了不同的方法,比如插值(method='time')、插值(method='linear')和插值(method='index'),但结果都差不多。请问我在使用这个pandas方法时,哪里理解错了或者做错了呢?
import pandas as pd
import numpy as np
# Generating random data
np.random.seed(0)
num_rows = 20
data = {
'temperature': np.random.randint(20, 30, num_rows),
'humidity': np.random.randint(40, 60, num_rows)
}
print(data)
# Generating random time indices
# Generating random time offsets for each row
time_offsets = np.random.randint(0, 120, num_rows)
time_offsets = pd.to_timedelta(time_offsets, unit='s')
# Generating random start and end times
start_time = pd.Timestamp('2024-02-24 9:55:37')
end_time = pd.Timestamp('2024-02-24 11:00:00')
# Generating time indices for each row
time_indices = [start_time + pd.Timedelta(minutes=2*i) + offset for i, offset in enumerate(time_offsets)]
print(time_indices)
# Creating DataFrame
combined_data = pd.DataFrame(data, index=time_indices)
print("Random DataFrame:")
print(combined_data)
# Resample the data to 2-minute frequency
resampled_data = combined_data.resample('2min').interpolate(method='time')
print("\nResampled DataFrame:")
print(resampled_data)
下面是我得到的结果。插值后的数据框在某些行上重复,然后输出的平均值和我手动计算的结果完全不一样。
Random DataFrame:
temperature humidity
2024-02-24 09:56:00 25 45
2024-02-24 09:57:46 20 53
2024-02-24 10:00:34 23 48
2024-02-24 10:02:09 23 49
2024-02-24 10:04:08 27 59
2024-02-24 10:06:51 29 56
2024-02-24 10:09:33 23 59
2024-02-24 10:10:00 25 45
2024-02-24 10:12:12 22 55
2024-02-24 10:14:52 24 55
2024-02-24 10:17:31 27 40
2024-02-24 10:18:32 26 58
2024-02-24 10:20:05 28 43
2024-02-24 10:22:11 28 57
2024-02-24 10:23:37 21 59
2024-02-24 10:25:37 26 59
2024-02-24 10:28:13 27 59
2024-02-24 10:30:30 27 54
2024-02-24 10:31:42 28 47
2024-02-24 10:34:00 21 40
Resampled DataFrame:
temperature humidity
2024-02-24 09:56:00 25.000000 45.000000
2024-02-24 09:58:00 25.000000 45.000000
2024-02-24 10:00:00 25.000000 45.000000
2024-02-24 10:02:00 25.000000 45.000000
2024-02-24 10:04:00 25.000000 45.000000
2024-02-24 10:06:00 25.000000 45.000000
2024-02-24 10:08:00 25.000000 45.000000
2024-02-24 10:10:00 25.000000 45.000000
2024-02-24 10:12:00 24.666667 44.583333
2024-02-24 10:14:00 24.333333 44.166667
2024-02-24 10:16:00 24.000000 43.750000
2024-02-24 10:18:00 23.666667 43.333333
2024-02-24 10:20:00 23.333333 42.916667
2024-02-24 10:22:00 23.000000 42.500000
2024-02-24 10:24:00 22.666667 42.083333
2024-02-24 10:26:00 22.333333 41.666667
2024-02-24 10:28:00 22.000000 41.250000
2024-02-24 10:30:00 21.666667 40.833333
2024-02-24 10:32:00 21.333333 40.416667
2024-02-24 10:34:00 21.000000 40.000000
非常感谢!
我在pandas的插值方法中尝试了不同的方式。我希望温度和湿度的值能根据它们的时间戳被正确插值或外推。
2 个回答
0
# From OP
resampled_data = combined_data.resample('2min').interpolate(method='time')
这段代码会在同一个2分钟的时间段内进行数值插值,也就是填补缺失的数据。
如果我理解得没错,你想要的是重新采样这些数据,然后计算它们的平均值(或者是第一个值、最后一个值?),接着再进行插值处理:
resampled_data = combined_data.resample("2min").mean().interpolate("linear")
Resampled DataFrame:
temperature humidity
2024-02-24 09:56:00 22.50 49.00
2024-02-24 09:58:00 22.75 48.50
2024-02-24 10:00:00 23.00 48.00
2024-02-24 10:02:00 23.00 49.00
2024-02-24 10:04:00 27.00 59.00
2024-02-24 10:06:00 29.00 56.00
2024-02-24 10:08:00 23.00 59.00
2024-02-24 10:10:00 25.00 45.00
2024-02-24 10:12:00 22.00 55.00
2024-02-24 10:14:00 24.00 55.00
2024-02-24 10:16:00 27.00 40.00
2024-02-24 10:18:00 26.00 58.00
2024-02-24 10:20:00 28.00 43.00
2024-02-24 10:22:00 24.50 58.00
2024-02-24 10:24:00 26.00 59.00
2024-02-24 10:26:00 26.50 59.00
2024-02-24 10:28:00 27.00 59.00
2024-02-24 10:30:00 27.50 50.50
2024-02-24 10:32:00 24.25 45.25
2024-02-24 10:34:00 21.00 40.00
0
我会考虑用平均值来重新采样温度,像下面这样:
import numpy as np
import pandas as pd
np.random.seed(0)
num_rows = 20
data = {
'temperature': np.random.randint(20, 30, num_rows),
'humidity': np.random.randint(40, 60, num_rows)
}
time_offsets = np.random.randint(0, 120, num_rows)
time_offsets = pd.to_timedelta(time_offsets, unit='s')
start_time = pd.Timestamp('2024-02-24 9:55:37')
time_indices = [start_time + pd.Timedelta(minutes=2*i) + offset for i, offset in enumerate(time_offsets)]
combined_data = pd.DataFrame(data, index=time_indices)
resampled_data = combined_data.resample('2min').mean()
interpolated_data = resampled_data.interpolate(method='time')
combined_data, resampled_data.head(10), interpolated_data.head(10)
这样你就可以得到
( temperature humidity
2024-02-24 09:56:42 25 45
2024-02-24 09:57:46 20 53
2024-02-24 10:00:34 23 48
2024-02-24 10:02:09 23 49
2024-02-24 10:04:08 27 59
2024-02-24 10:06:51 29 56
2024-02-24 10:09:33 23 59
2024-02-24 10:10:00 25 45
2024-02-24 10:12:12 22 55
2024-02-24 10:14:52 24 55
2024-02-24 10:17:31 27 40
2024-02-24 10:18:32 26 58
2024-02-24 10:20:05 28 43
2024-02-24 10:22:11 28 57
2024-02-24 10:23:37 21 59
2024-02-24 10:25:37 26 59
2024-02-24 10:28:13 27 59
2024-02-24 10:30:30 27 54
2024-02-24 10:31:42 28 47
2024-02-24 10:34:15 21 40,
temperature humidity
2024-02-24 09:56:00 22.5 49.0
2024-02-24 09:58:00 NaN NaN
2024-02-24 10:00:00 23.0 48.0
2024-02-24 10:02:00 23.0 49.0
2024-02-24 10:04:00 27.0 59.0
2024-02-24 10:06:00 29.0 56.0
2024-02-24 10:08:00 23.0 59.0
2024-02-24 10:10:00 25.0 45.0
2024-02-24 10:12:00 22.0 55.0
2024-02-24 10:14:00 24.0 55.0,
temperature humidity
2024-02-24 09:56:00 22.50 49.0
2024-02-24 09:58:00 22.75 48.5
2024-02-24 10:00:00 23.00 48.0
2024-02-24 10:02:00 23.00 49.0
2024-02-24 10:04:00 27.00 59.0
2024-02-24 10:06:00 29.00 56.0
2024-02-24 10:08:00 23.00 59.0
2024-02-24 10:10:00 25.00 45.0
2024-02-24 10:12:00 22.00 55.0
2024-02-24 10:14:00 24.00 55.0)