pandas.DataFrame.interpolate() 方法在时间序列数据中不正确插值或外推

Question

我在使用pandas中的插值方法pandas.DataFrame.interpolate()时遇到了一些问题。我有一组时间序列数据，每个数据点之间大约相隔2分钟。我想把这些数据重新采样成正好每2分钟一个数据点，以便后面能和其他数据同步。问题是，温度和湿度的值插值结果似乎不太对。我尝试了不同的方法，比如插值(method='time')、插值(method='linear')和插值(method='index')，但结果都差不多。请问我在使用这个pandas方法时，哪里理解错了或者做错了呢？

import pandas as pd
import numpy as np

# Generating random data
np.random.seed(0)
num_rows = 20
data = {
    'temperature': np.random.randint(20, 30, num_rows),
    'humidity': np.random.randint(40, 60, num_rows)
}
print(data)
# Generating random time indices
# Generating random time offsets for each row
time_offsets = np.random.randint(0, 120, num_rows)
time_offsets = pd.to_timedelta(time_offsets, unit='s')

# Generating random start and end times
start_time = pd.Timestamp('2024-02-24 9:55:37')
end_time = pd.Timestamp('2024-02-24 11:00:00')

# Generating time indices for each row
time_indices = [start_time + pd.Timedelta(minutes=2*i) + offset for i, offset in enumerate(time_offsets)]

print(time_indices)
# Creating DataFrame
combined_data = pd.DataFrame(data, index=time_indices)

print("Random DataFrame:")
print(combined_data)


# Resample the data to 2-minute frequency
resampled_data = combined_data.resample('2min').interpolate(method='time')


print("\nResampled DataFrame:")
print(resampled_data)

下面是我得到的结果。插值后的数据框在某些行上重复，然后输出的平均值和我手动计算的结果完全不一样。

Random DataFrame:
                     temperature  humidity
2024-02-24 09:56:00           25        45
2024-02-24 09:57:46           20        53
2024-02-24 10:00:34           23        48
2024-02-24 10:02:09           23        49
2024-02-24 10:04:08           27        59
2024-02-24 10:06:51           29        56
2024-02-24 10:09:33           23        59
2024-02-24 10:10:00           25        45
2024-02-24 10:12:12           22        55
2024-02-24 10:14:52           24        55
2024-02-24 10:17:31           27        40
2024-02-24 10:18:32           26        58
2024-02-24 10:20:05           28        43
2024-02-24 10:22:11           28        57
2024-02-24 10:23:37           21        59
2024-02-24 10:25:37           26        59
2024-02-24 10:28:13           27        59
2024-02-24 10:30:30           27        54
2024-02-24 10:31:42           28        47
2024-02-24 10:34:00           21        40

Resampled DataFrame:
                     temperature   humidity
2024-02-24 09:56:00    25.000000  45.000000
2024-02-24 09:58:00    25.000000  45.000000
2024-02-24 10:00:00    25.000000  45.000000
2024-02-24 10:02:00    25.000000  45.000000
2024-02-24 10:04:00    25.000000  45.000000
2024-02-24 10:06:00    25.000000  45.000000
2024-02-24 10:08:00    25.000000  45.000000
2024-02-24 10:10:00    25.000000  45.000000
2024-02-24 10:12:00    24.666667  44.583333
2024-02-24 10:14:00    24.333333  44.166667
2024-02-24 10:16:00    24.000000  43.750000
2024-02-24 10:18:00    23.666667  43.333333
2024-02-24 10:20:00    23.333333  42.916667
2024-02-24 10:22:00    23.000000  42.500000
2024-02-24 10:24:00    22.666667  42.083333
2024-02-24 10:26:00    22.333333  41.666667
2024-02-24 10:28:00    22.000000  41.250000
2024-02-24 10:30:00    21.666667  40.833333
2024-02-24 10:32:00    21.333333  40.416667
2024-02-24 10:34:00    21.000000  40.000000

非常感谢！

我在pandas的插值方法中尝试了不同的方式。我希望温度和湿度的值能根据它们的时间戳被正确插值或外推。

时间序列 pandas库数据重采样数据插值温度湿度方法选择数据同步插值算法

pandas.DataFrame.interpolate() 方法在时间序列数据中不正确插值或外推

2 个回答

撰写回答