在Pandas中查找给定时间戳之前的最后一个值
对于下面这个数据表,我想在每一行添加一列,用来记录在不同时间间隔下的ask_size,举个例子,比如说1毫秒。
比如说,对于第一行,1毫秒之前的大小应该是165,因为这是在1毫秒之前的ask_size - 尽管前面的时间戳(2024-02-12 09:00:00.178941829)早得多,但它仍然是**有效的** 1毫秒之前的大小。
再举个例子,第三行到第八行的ask_size都应该是203,因为在时间戳2024-02-12 09:00:00.334723166时的大小就是203,这个时间戳是第三行到第八行的最后一个1毫秒之前的时间。
我一直在研究merge_asof,尝试了一些方法,但都没有成功。希望能得到一些帮助!
表格示例
idx event_timestamp ask_size
0 2024-02-12 09:00:00.178941829 165
1 2024-02-12 09:00:00.334673928 166
2 2024-02-12 09:00:00.334723166 203
3 2024-02-12 09:00:00.339505589 203
4 2024-02-12 09:00:00.339517572 241
5 2024-02-12 09:00:00.339585194 276
6 2024-02-12 09:00:00.339597200 276
7 2024-02-12 09:00:00.339679756 277
8 2024-02-12 09:00:00.339705796 312
9 2024-02-12 09:00:00.343967540 275
10 2024-02-12 09:00:00.393306026 275
原始数据
data = {
'event_timestamp': ['2024-02-12 09:00:00.178941829', '2024-02-12 09:00:00.334673928',
'2024-02-12 09:00:00.334723166', '2024-02-12 09:00:00.339505589',
'2024-02-12 09:00:00.339517572', '2024-02-12 09:00:00.339585194',
'2024-02-12 09:00:00.339597200', '2024-02-12 09:00:00.339679756',
'2024-02-12 09:00:00.339705796', '2024-02-12 09:00:00.343967540'],
'ask_size_1_x': [165.0, 166.0, 203.0, 203.0, 241.0, 276.0, 276.0, 277.0, 312.0, 275.0]
}
df = pd.DataFrame(data)
尝试
data['1ms'] = data['event_timestamp'] - pd.Timedelta(milliseconds=1)
temp = data[['event_timestamp','ask_size_1']]
temp_time_shift = data[['1ms','ask_size_1']]
temp2 = pd.merge_asof(
temp,
temp_time_shift,
left_on = 'event_timestamp',
right_on = '1ms',
direction='backward'
)
编辑 建议:
import pandas as pd
data = {
'event_timestamp': [
'2024-02-12 09:00:00.393306026',
'2024-02-12 09:00:00.393347792',
'2024-02-12 09:00:00.393351971',
'2024-02-12 09:00:00.393355738',
'2024-02-12 09:00:00.393389724',
'2024-02-12 09:00:00.542780521',
'2024-02-12 09:00:00.542841917',
'2024-02-12 09:00:00.714845055',
'2024-02-12 09:00:00.714908862',
'2024-02-12 09:00:00.747016524'
],
'ask_size_1': [275.0, 275.0, 237.0, 237.0, 202.0, 202.0, 202.0, 262.0, 261.0, 263.0]
}
df = pd.DataFrame(data)
df['event_timestamp'] = pd.to_datetime(df['event_timestamp']) # Convert 'event_timestamp' to datetime format
tolerance = pd.Timedelta('1ms')
df['out'] = pd.merge_asof(df['event_timestamp'].sub(tolerance),
df[['event_timestamp', 'ask_size_1']],
direction='forward', tolerance=tolerance
)['ask_size_1']
下面是输出结果,你可以看到第七行的ask_size和输出值是一样的。输出值应该是第七行之前至少1毫秒的最后一个ask_size,也就是第六行的值202。
从这个角度看,黄色的部分技术上可以是NaN,因为在1毫秒之前没有值。
event_timestamp ask_size_1 out
0 2024-02-12 09:00:00.393306026 275.0 275.0
1 2024-02-12 09:00:00.393347792 275.0 275.0
2 2024-02-12 09:00:00.393351971 237.0 275.0
3 2024-02-12 09:00:00.393355738 237.0 275.0
4 2024-02-12 09:00:00.393389724 202.0 275.0
5 2024-02-12 09:00:00.542780521 202.0 202.0
6 2024-02-12 09:00:00.542841917 202.0 202.0
7 2024-02-12 09:00:00.714845055 262.0 262.0
8 2024-02-12 09:00:00.714908862 261.0 262.0
9 2024-02-12 09:00:00.747016524 263.0 263.0
1 个回答
1
如果我理解得没错,你确实可以使用一个叫做 merge_asof
的功能。不过,你需要调整一些参数,以确保搜索的顺序是正确的。
delta = pd.Timedelta('1ms')
df['out'] = pd.merge_asof(df['event_timestamp'].sub(delta), df,
direction='backward')['ask_size_1']
注意:我假设时间戳已经排好序了。如果没有,你需要在运行 merge_asof
之前先把它们排序。
输出结果:
event_timestamp ask_size_1 out
0 2024-02-12 09:00:00.393306026 271.0 NaN
1 2024-02-12 09:00:00.393347792 275.0 NaN
2 2024-02-12 09:00:00.393351971 237.0 NaN
3 2024-02-12 09:00:00.393355738 237.0 NaN
4 2024-02-12 09:00:00.393389724 202.0 NaN
5 2024-02-12 09:00:00.542780521 206.0 202.0
6 2024-02-12 09:00:00.542841917 51.0 202.0
7 2024-02-12 09:00:00.714845055 262.0 51.0
8 2024-02-12 09:00:00.714908862 261.0 51.0
9 2024-02-12 09:00:00.747016524 263.0 261.0
如果你想要得到黄色值对应的 271
,你可以稍微调整一下:
tmp = pd.concat([pd.DataFrame({'event_timestamp': [df['event_timestamp'].iloc[0]-delta],
'ask_size_1': [df['ask_size_1'].iloc[0]]}),
df])
delta = pd.Timedelta('1ms')
df['out'] = pd.merge_asof(df['event_timestamp'].sub(delta), tmp,
direction='backward',
allow_exact_matches=False)['ask_size_1']
输出结果:
event_timestamp ask_size_1 out
0 2024-02-12 09:00:00.393306026 271.0 NaN
1 2024-02-12 09:00:00.393347792 275.0 271.0
2 2024-02-12 09:00:00.393351971 237.0 271.0
3 2024-02-12 09:00:00.393355738 237.0 271.0
4 2024-02-12 09:00:00.393389724 202.0 271.0
5 2024-02-12 09:00:00.542780521 206.0 202.0
6 2024-02-12 09:00:00.542841917 51.0 202.0
7 2024-02-12 09:00:00.714845055 262.0 51.0
8 2024-02-12 09:00:00.714908862 261.0 51.0
9 2024-02-12 09:00:00.747016524 263.0 261.0