用停止填充pandas Series中的NA值
我正在分析一个时间序列,根据一些特定的标准,我可以挑选出事件的开始或结束的行。到目前为止,我的序列大致是这样的(为了简洁,我省略了一些重复的值):
设置
import numpy as np
import pandas
from pandas import Timestamp
datadict = {'event': {
Timestamp('2010-01-01 00:20:00', tz=None): 'event start',
Timestamp('2010-01-01 00:30:00', tz=None): '--',
Timestamp('2010-01-01 00:40:00', tz=None): '--',
Timestamp('2010-01-01 00:50:00', tz=None): '--',
Timestamp('2010-01-01 01:00:00', tz=None): '--',
Timestamp('2010-01-01 01:10:00', tz=None): 'event end',
Timestamp('2010-01-01 01:20:00', tz=None): '--',
Timestamp('2010-01-01 02:20:00', tz=None): '--',
Timestamp('2010-01-01 02:30:00', tz=None): 'event start',
Timestamp('2010-01-01 02:40:00', tz=None): '--',
Timestamp('2010-01-01 02:50:00', tz=None): '--',
Timestamp('2010-01-01 03:00:00', tz=None): '--',
Timestamp('2010-01-01 03:10:00', tz=None): '--',
Timestamp('2010-01-01 03:20:00', tz=None): '--',
Timestamp('2010-01-01 03:30:00', tz=None): 'event end',
}}
data = pandas.DataFrame.from_dict(datadict)
event
2010-01-01 00:20:00 event start
2010-01-01 00:30:00 --
2010-01-01 00:40:00 --
2010-01-01 00:50:00 --
2010-01-01 01:00:00 --
2010-01-01 01:10:00 event end
2010-01-01 01:20:00 --
2010-01-01 02:20:00 --
2010-01-01 02:30:00 event start
2010-01-01 02:40:00 --
2010-01-01 02:50:00 --
2010-01-01 03:00:00 --
2010-01-01 03:10:00 --
2010-01-01 03:20:00 --
2010-01-01 03:30:00 event end
我想要达到的目标是(最好不要用for
循环)
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- NA
2010-01-01 02:20:00 -- NA
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
2010-01-01 03:40:00 -- NA
2010-01-01 03:50:00 -- NA
我尝试过的
在对我的数据质量做了一些乐观的假设后,我可以得到这样的事件编号:
table = data[data.event != '--'].reset_index()
table['event number'] = 1 + np.floor(table.index / 2)
table = table.set_index('index')
event event number
index
2010-01-01 00:20:00 event start 1
2010-01-01 01:10:00 event end 1
2010-01-01 02:30:00 event start 2
2010-01-01 03:30:00 event end 2
然后我可以将其与原始数据框进行join
操作,并用method='ffill'
来fillna
。
data2 = data.join(table[['event number']])
data2['filled'] = data2['event number'].fillna(method='ffill')
event event number filled
2010-01-01 00:20:00 event start 1 1
2010-01-01 00:30:00 -- NaN 1
2010-01-01 00:40:00 -- NaN 1
2010-01-01 00:50:00 -- NaN 1
2010-01-01 01:00:00 -- NaN 1
2010-01-01 01:10:00 event end 1 1
2010-01-01 01:20:00 -- NaN 1 # <- d'oh
2010-01-01 02:20:00 -- NaN 1 # <- d'oh
2010-01-01 02:30:00 event start 2 2
2010-01-01 02:40:00 -- NaN 2
2010-01-01 02:50:00 -- NaN 2
2010-01-01 03:00:00 -- NaN 2
2010-01-01 03:10:00 -- NaN 2
2010-01-01 03:20:00 -- NaN 2
2010-01-01 03:30:00 event end 2 2
问题
如你所见,事件之间的时间(从01:20到02:20)被错误地与事件#1关联在一起。
问题
有没有办法在不使用循环的情况下跳过这些部分?
1 个回答
5
你可以通过查看事件开始和事件结束的累计总和来实现这个目标:
>>> data['event number'] = (data.event == 'event start').cumsum()
>>> data
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- 1
2010-01-01 02:20:00 -- 1
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
现在你只需要在没有事件的地方设置为nan
;这些地方对应的行是事件开始的累计总和等于事件结束的累计总和的地方(需要向下移动一行)。
>>> idx = data['event number'] == (data.event.shift(1) == 'event end').cumsum()
>>> data.loc[idx, 'event number'] = np.nan
>>> data
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- NaN
2010-01-01 02:20:00 -- NaN
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
[15 rows x 2 columns]