基于另一个数据框的时间对pandas数据框进行滚动平均计算

1 投票
1 回答
44 浏览
提问于 2025-04-14 15:45

我有一个数据框(df1),大约每15秒就有一条数据,还有另一个数据框(df2),大约每5分钟有一条数据。下面是这两个数据框的内容。我想把df2中的“时间”变量与df1中最接近的时间戳匹配,然后以这个最接近的时间戳为中心,计算5分钟内的df1数据平均值。

df1:

                  time  speed
0  2022-10-04 00:00:24  4.590
1  2022-10-04 00:00:41  4.389
2  2022-10-04 00:00:57  4.367
3  2022-10-04 00:01:14  4.539
4  2022-10-04 00:01:30  4.584
5  2022-10-04 00:01:48  4.523
6  2022-10-04 00:02:05  4.498
7  2022-10-04 00:02:21  4.625
8  2022-10-04 00:02:38  4.497
9  2022-10-04 00:02:54  4.406
10 2022-10-04 00:03:12  4.502
11 2022-10-04 00:03:28  4.494
12 2022-10-04 00:03:45  4.445
13 2022-10-04 00:04:01  4.438
14 2022-10-04 00:04:18  4.433
15 2022-10-04 00:04:36  4.441
16 2022-10-04 00:04:52  4.400
17 2022-10-04 00:05:09  4.221
18 2022-10-04 00:05:27  4.115
19 2022-10-04 00:05:43  4.009
20 2022-10-04 00:06:01  4.230
21 2022-10-04 00:06:18  4.360
22 2022-10-04 00:06:34  4.331
23 2022-10-04 00:06:51  4.178
24 2022-10-04 00:07:07  4.238
25 2022-10-04 00:07:25  4.125
26 2022-10-04 00:07:43  3.988
27 2022-10-04 00:08:17  3.573
28 2022-10-04 00:08:34  4.471
29 2022-10-04 00:08:50  4.567
30 2022-10-04 00:09:08  4.451
31 2022-10-04 00:09:25  4.311
32 2022-10-04 00:09:42  4.280
33 2022-10-04 00:09:59  4.439
34 2022-10-04 00:10:17  4.410
35 2022-10-04 00:10:35  4.335
36 2022-10-04 00:10:51  4.193
37 2022-10-04 00:11:08  4.140
38 2022-10-04 00:11:25  4.020
39 2022-10-04 00:11:43  3.872
40 2022-10-04 00:12:01  3.859
41 2022-10-04 00:12:17  4.062
42 2022-10-04 00:12:34  3.861
43 2022-10-04 00:12:51  3.780
44 2022-10-04 00:13:07  3.680
45 2022-10-04 00:13:25  3.909
46 2022-10-04 00:13:42  3.852
47 2022-10-04 00:13:58  3.867
48 2022-10-04 00:14:15  3.715
49 2022-10-04 00:14:32  3.534
50 2022-10-04 00:14:49  3.349
51 2022-10-04 00:15:06  3.213
52 2022-10-04 00:15:23  3.215
53 2022-10-04 00:15:39  3.246
54 2022-10-04 00:15:55  3.195
55 2022-10-04 00:16:14  3.164
56 2022-10-04 00:16:30  3.149
57 2022-10-04 00:16:47  3.281
58 2022-10-04 00:17:03  3.366
59 2022-10-04 00:17:20  3.295
60 2022-10-04 00:17:38  3.487
61 2022-10-04 00:17:54  3.534
62 2022-10-04 00:18:11  3.430
63 2022-10-04 00:18:27  3.474
64 2022-10-04 00:18:44  3.275
65 2022-10-04 00:19:01  3.584
66 2022-10-04 00:19:18  3.616
67 2022-10-04 00:19:34  3.506
68 2022-10-04 00:19:51  3.561
69 2022-10-04 00:20:08  3.316
70 2022-10-04 00:20:27  3.396
71 2022-10-04 00:20:43  3.536
72 2022-10-04 00:20:59  3.631
73 2022-10-04 00:21:16  3.573
74 2022-10-04 00:21:33  3.514
75 2022-10-04 00:21:50  3.603
76 2022-10-04 00:22:07  3.591
77 2022-10-04 00:22:23  3.591
78 2022-10-04 00:22:40  3.659
79 2022-10-04 00:23:14  4.056

df2:

                 time     speed
0 2022-10-03 00:03:23  4.646689
1 2022-10-03 00:08:24  5.328516
2 2022-10-03 00:13:24  5.895778
3 2022-10-03 00:18:24  5.665014
4 2022-10-03 00:22:25  6.313763

我知道可以使用pandas.merge_asof这个方法来尽量对齐时间,并且可以设置一个'tolerance'参数来控制允许的误差范围。接下来,我尝试了几种组合使用pandas.groupby和pandas.rolling,但还是没能得到想要的结果。我不知道该怎么继续,所以任何帮助都非常感谢。

1 个回答

1

如果我理解得没错,你可以试试下面这样的做法:

df1_rolling_mean = (
    df1.rolling(window="300s", on="time", center=True)
    .mean()
    .rename(columns={"speed": "speed_avg"})
)
df1 = pd.merge(df1, df1_rolling_mean)

df = pd.merge_asof(
    df1,
    df2.rename(columns={"speed": "speed_df2"}),
    on="time",
    tolerance=pd.Timedelta(seconds=15),
)

df = df[df.loc[:, "speed_df2"].notna()]

这样做的结果会是:

                  time  speed  speed_avg  speed_df2
11 2022-10-04 00:03:28  4.494   4.421765   4.646689
28 2022-10-04 00:08:34  4.471   4.265625   5.328516
45 2022-10-04 00:13:25  3.909   3.687167   5.895778
63 2022-10-04 00:18:27  3.474   3.410000   5.665014
78 2022-10-04 00:22:40  3.659   3.615000   6.313763
  • speed_avg 是从 df1speed 列计算出的5分钟平均值;
  • 在用 merge_asof 按照 time 列合并时,我们只保留 df1df2time 列有共同值的那些列,并且允许的时间差是15秒。

撰写回答