基于另一个数据框的时间对pandas数据框进行滚动平均计算
我有一个数据框(df1),大约每15秒就有一条数据,还有另一个数据框(df2),大约每5分钟有一条数据。下面是这两个数据框的内容。我想把df2中的“时间”变量与df1中最接近的时间戳匹配,然后以这个最接近的时间戳为中心,计算5分钟内的df1数据平均值。
df1:
time speed
0 2022-10-04 00:00:24 4.590
1 2022-10-04 00:00:41 4.389
2 2022-10-04 00:00:57 4.367
3 2022-10-04 00:01:14 4.539
4 2022-10-04 00:01:30 4.584
5 2022-10-04 00:01:48 4.523
6 2022-10-04 00:02:05 4.498
7 2022-10-04 00:02:21 4.625
8 2022-10-04 00:02:38 4.497
9 2022-10-04 00:02:54 4.406
10 2022-10-04 00:03:12 4.502
11 2022-10-04 00:03:28 4.494
12 2022-10-04 00:03:45 4.445
13 2022-10-04 00:04:01 4.438
14 2022-10-04 00:04:18 4.433
15 2022-10-04 00:04:36 4.441
16 2022-10-04 00:04:52 4.400
17 2022-10-04 00:05:09 4.221
18 2022-10-04 00:05:27 4.115
19 2022-10-04 00:05:43 4.009
20 2022-10-04 00:06:01 4.230
21 2022-10-04 00:06:18 4.360
22 2022-10-04 00:06:34 4.331
23 2022-10-04 00:06:51 4.178
24 2022-10-04 00:07:07 4.238
25 2022-10-04 00:07:25 4.125
26 2022-10-04 00:07:43 3.988
27 2022-10-04 00:08:17 3.573
28 2022-10-04 00:08:34 4.471
29 2022-10-04 00:08:50 4.567
30 2022-10-04 00:09:08 4.451
31 2022-10-04 00:09:25 4.311
32 2022-10-04 00:09:42 4.280
33 2022-10-04 00:09:59 4.439
34 2022-10-04 00:10:17 4.410
35 2022-10-04 00:10:35 4.335
36 2022-10-04 00:10:51 4.193
37 2022-10-04 00:11:08 4.140
38 2022-10-04 00:11:25 4.020
39 2022-10-04 00:11:43 3.872
40 2022-10-04 00:12:01 3.859
41 2022-10-04 00:12:17 4.062
42 2022-10-04 00:12:34 3.861
43 2022-10-04 00:12:51 3.780
44 2022-10-04 00:13:07 3.680
45 2022-10-04 00:13:25 3.909
46 2022-10-04 00:13:42 3.852
47 2022-10-04 00:13:58 3.867
48 2022-10-04 00:14:15 3.715
49 2022-10-04 00:14:32 3.534
50 2022-10-04 00:14:49 3.349
51 2022-10-04 00:15:06 3.213
52 2022-10-04 00:15:23 3.215
53 2022-10-04 00:15:39 3.246
54 2022-10-04 00:15:55 3.195
55 2022-10-04 00:16:14 3.164
56 2022-10-04 00:16:30 3.149
57 2022-10-04 00:16:47 3.281
58 2022-10-04 00:17:03 3.366
59 2022-10-04 00:17:20 3.295
60 2022-10-04 00:17:38 3.487
61 2022-10-04 00:17:54 3.534
62 2022-10-04 00:18:11 3.430
63 2022-10-04 00:18:27 3.474
64 2022-10-04 00:18:44 3.275
65 2022-10-04 00:19:01 3.584
66 2022-10-04 00:19:18 3.616
67 2022-10-04 00:19:34 3.506
68 2022-10-04 00:19:51 3.561
69 2022-10-04 00:20:08 3.316
70 2022-10-04 00:20:27 3.396
71 2022-10-04 00:20:43 3.536
72 2022-10-04 00:20:59 3.631
73 2022-10-04 00:21:16 3.573
74 2022-10-04 00:21:33 3.514
75 2022-10-04 00:21:50 3.603
76 2022-10-04 00:22:07 3.591
77 2022-10-04 00:22:23 3.591
78 2022-10-04 00:22:40 3.659
79 2022-10-04 00:23:14 4.056
df2:
time speed
0 2022-10-03 00:03:23 4.646689
1 2022-10-03 00:08:24 5.328516
2 2022-10-03 00:13:24 5.895778
3 2022-10-03 00:18:24 5.665014
4 2022-10-03 00:22:25 6.313763
我知道可以使用pandas.merge_asof这个方法来尽量对齐时间,并且可以设置一个'tolerance'参数来控制允许的误差范围。接下来,我尝试了几种组合使用pandas.groupby和pandas.rolling,但还是没能得到想要的结果。我不知道该怎么继续,所以任何帮助都非常感谢。
1 个回答
1
如果我理解得没错,你可以试试下面这样的做法:
df1_rolling_mean = (
df1.rolling(window="300s", on="time", center=True)
.mean()
.rename(columns={"speed": "speed_avg"})
)
df1 = pd.merge(df1, df1_rolling_mean)
df = pd.merge_asof(
df1,
df2.rename(columns={"speed": "speed_df2"}),
on="time",
tolerance=pd.Timedelta(seconds=15),
)
df = df[df.loc[:, "speed_df2"].notna()]
这样做的结果会是:
time speed speed_avg speed_df2
11 2022-10-04 00:03:28 4.494 4.421765 4.646689
28 2022-10-04 00:08:34 4.471 4.265625 5.328516
45 2022-10-04 00:13:25 3.909 3.687167 5.895778
63 2022-10-04 00:18:27 3.474 3.410000 5.665014
78 2022-10-04 00:22:40 3.659 3.615000 6.313763
speed_avg
是从df1
的speed
列计算出的5分钟平均值;- 在用
merge_asof
按照time
列合并时,我们只保留df1
和df2
中time
列有共同值的那些列,并且允许的时间差是15秒。