重采样包含经过时间值的pandas时间序列
我有一些时间序列数据,格式在这篇文章底部有展示。
我想把这些数据重新采样成30分钟的间隔,但我需要把“状态时间”的值按正确的间隔进行拆分(这些值是以整秒表示的)。
现在假设某一行的“状态时间”是2342秒(超过30分钟),而开始时间是08:22:00。
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 08:22:00 A 2342
当重新采样完成后,我需要把“状态时间”按它溢出到的时间段进行拆分,像这样:
User Start Date Time Period State Time in State (secs)
J.Doe 03-02-2014 08:00:00 A 480
J.Doe 03-02-2014 08:30:00 A 1800
J.Doe 03-02-2014 09:00:00 A 62
480+1800+62 = 2342
我完全不知道该怎么在pandas中实现这个...希望能得到一些帮助:-)
源数据格式:
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 07:58:00 A 36
J.Doe 03-02-2014 07:59:00 A 43
J.Doe 03-02-2014 08:00:00 A 59
J.Doe 03-02-2014 08:01:00 A 32
J.Doe 03-02-2014 08:21:00 A 15
J.Doe 03-02-2014 08:22:00 B 3
J.Doe 03-02-2014 08:22:00 A 2342
J.Doe 03-02-2014 09:01:00 B 1
J.Doe 03-02-2014 09:01:00 A 375
J.Doe 03-02-2014 09:07:00 B 3
J.Doe 03-02-2014 09:07:00 A 6408
J.Doe 03-02-2014 10:54:00 B 2
J.Doe 03-02-2014 10:54:00 A 116
J.Doe 03-02-2014 10:58:00 B 2
J.Doe 03-02-2014 10:58:00 A 122
J.Doe 03-02-2014 10:58:00 A 12
J.Doe 03-02-2014 11:00:00 B 2
J.Doe 03-02-2014 11:00:00 A 3417
J.Doe 03-02-2014 11:57:00 B 3
J.Doe 03-02-2014 11:57:00 A 120
J.Doe 03-02-2014 11:59:00 C 165
J.Doe 03-02-2014 12:02:00 B 3
J.Doe 03-02-2014 12:02:00 A 7254
1 个回答
1
我首先会创建“开始”和“结束”这两列(作为日期时间对象):
In [11]: df['Start'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
In [12]: df['End'] = df['Start'] + df['Time in State (secs)'].apply(pd.offsets.Second)
In [13]: row = df.iloc[6, :]
In [14]: row
Out[14]:
User J.Doe
Start Date 03-02-2014
Start Time 08:22:00
State A
Time in State (secs) 2342
Start 2014-03-02 08:22:00
End 2014-03-02 09:01:02
Name: 6, dtype: object
获取分割时间的一种方法是从“开始”和“结束”进行重采样,合并后使用差分计算:
def split_times(row):
y = pd.Series(0, [row['Start'], row['End']])
splits = y.resample('30min').index + y.index # this fills in middle and sorts too
res = -splits.to_series().diff(-1)
if len(res) > 2: res = res[1:-1]
elif len(res) == 2: res = res[1:]
return res.astype(int).resample('30min').astype(np.timedelta64) # hack to resample again
In [16]: split_times(row)
Out[16]:
2014-03-02 08:22:00 00:08:00
2014-03-02 08:30:00 00:30:00
2014-03-02 09:00:00 00:01:02
dtype: timedelta64[ns]
In [17]: df.apply(split_times, 1)
Out[17]:
2014-03-02 07:30:00 2014-03-02 08:00:00 2014-03-02 08:30:00 2014-03-02 09:00:00 2014-03-02 09:30:00 2014-03-02 10:00:00 2014-03-02 10:30:00 2014-03-02 11:00:00 2014-03-02 11:30:00 2014-03-02 12:00:00 2014-03-02 12:30:00 2014-03-02 13:00:00 2014-03-02 13:30:00 2014-03-02 14:00:00
0 00:00:36 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 00:00:43 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT 00:00:32 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT 00:00:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT 00:08:00 00:30:00 00:01:02 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT 00:00:01 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT 00:06:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
10 NaT NaT NaT 00:23:00 00:30:00 00:30:00 00:23:48 NaT NaT NaT NaT NaT NaT NaT
11 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
12 NaT NaT NaT NaT NaT NaT 00:01:56 NaT NaT NaT NaT NaT NaT NaT
13 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
14 NaT NaT NaT NaT NaT NaT 00:02:00 00:00:02 NaT NaT NaT NaT NaT NaT
15 NaT NaT NaT NaT NaT NaT 00:00:12 NaT NaT NaT NaT NaT NaT NaT
16 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
17 NaT NaT NaT NaT NaT NaT NaT NaT 00:26:57 NaT NaT NaT NaT NaT
18 NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT
19 NaT NaT NaT NaT NaT NaT NaT NaT 00:02:00 NaT NaT NaT NaT NaT
20 NaT NaT NaT NaT NaT NaT NaT NaT 00:01:00 00:01:45 NaT NaT NaT NaT
21 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT
22 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:28:00 00:30:00 00:30:00 00:30:00 00:02:54
要把NaT(缺失的时间)替换成0,看起来在0.13.1版本中你需要做一些调整(这个问题可能在主版本中已经修复,否则就是个bug):
res2 = df.apply(split_times, 1).astype(int)
# hack to replace NaTs with 0
res2.where(res2 != -9223372036854775808, 0).astype(np.timedelta64)
# to just get the seconds
seconds = res2.where(res2 != -9223372036854775808, 0) / 10 ** 9