将时间序列附加到数据框
我有一个数据表,看起来是这样的:
"2023-09-07 13:22" type1 12.7
"2023-09-07 14:07" type2 101.1
还有一个单独的数据表,里面是每种类型的等间隔时间序列:
type1 type2
2023-09-07 08:00 1 2
2023-09-07 08:15 3 4
2023-09-07 08:30 5 6
...
2023-09-07 13:15 7 8
2023-09-07 13:30 9 10
2023-09-07 13:45 11 12
2023-09-07 14:00 13 14
2023-09-07 14:15 15 16
2023-09-07 14:30 17 18
...
我想把第二个数据表中的2个(或者更多)值,作为一行,添加到第一个数据表的每一行中,从给定的时间戳之后开始。
所以在这个例子中,答案会是:
"2023-09-07 13:22" type1 12.7 9 11
"2023-09-07 14:07" type2 101.1 16 18
我可以遍历第一个数据表的每一行,每次在第二个数据表中找到一个切片,但这样做速度比较慢。我在想有没有更好的解决办法。这似乎是一个很常见的任务。
生成输入数据表的代码:
df1 = pd.DataFrame(columns = ["date", "type", "val"])
df1.loc[0] = [pd.to_datetime("2023-09-07 13:22:00"), "type1", 12.1]
df1.loc[1] = [pd.to_datetime("2023-09-07 14:07:00"), "type2", 101.1]
df1 = df1.set_index("date")
df2 = pd.DataFrame()
df2["date"] = pd.to_datetime(["2023-09-07 08:00", "2023-09-07 08:15","2023-09-07 08:30", "2023-09-07 13:15","2023-09-07 13:30", "2023-09-07 13:45","2023-09-07 14:00", "2023-09-07 14:15","2023-09-07 14:30"])
df2["type1"] = [1,3,5,7,9,11,13,15,17]
df2["type2"] = [2,4,6,8,10,12,14,16,18]
2 个回答
3
为了实现完全的向量化处理,可以使用numpy的 sliding_window_view
来制作一个滚动的DataFrame,然后用 merge_asof
来合并数据:
from numpy.lib.stride_tricks import sliding_window_view as swv
N = 2
# ensure date is the index
df2 = (df2.set_index('date')
.rename_axis(columns='type')
)
# create the sliding window
# reshape to "melted" format
tmp = (pd.DataFrame(swv(df2, N, axis=0)
.reshape(-1, N),
index=df2.iloc[:-N+1].stack(dropna=False).index)
.reset_index('type')
)
# merge_asof
out = pd.merge_asof(df1, tmp,
left_index=True, right_index=True,
by='type', direction='forward')
输出结果:
type val 0 1
date
2023-09-07 13:22:00 type1 12.1 9 11
2023-09-07 14:07:00 type2 101.1 16 18
中间结果 tmp
:
type 0 1
date
2023-09-07 08:00:00 type1 1 3
2023-09-07 08:00:00 type2 2 4
2023-09-07 08:15:00 type1 3 5
2023-09-07 08:15:00 type2 4 6
2023-09-07 08:30:00 type1 5 7
2023-09-07 08:30:00 type2 6 8
2023-09-07 13:15:00 type1 7 9
2023-09-07 13:15:00 type2 8 10
2023-09-07 13:30:00 type1 9 11
2023-09-07 13:30:00 type2 10 12
2023-09-07 13:45:00 type1 11 13
2023-09-07 13:45:00 type2 12 14
2023-09-07 14:00:00 type1 13 15
2023-09-07 14:00:00 type2 14 16
2023-09-07 14:15:00 type1 15 17
2023-09-07 14:15:00 type2 16 18
注意。如果需要(比如为了获取不完整的切片),你可以在 swv
之前对DataFrame进行 pad
操作。
from numpy.lib.stride_tricks import sliding_window_view as swv
N = 3
# ensure date is the index
df2 = (df2.set_index('date')
.rename_axis(columns='type')
)
# create the sliding window
# reshape to "melted" format
tmp = (pd.DataFrame(swv(np.pad(df2.astype(float), ((0, N-1), (0, 0)),
constant_values=np.nan),
N, axis=0)
.reshape(-1, N),
index=df2.stack(dropna=False).index)
.reset_index('type')
)
# merge_asof
out = pd.merge_asof(df1, tmp,
left_index=True, right_index=True,
by='type', direction='forward')
输出结果:
type val 0 1 2
date
2023-09-07 13:22:00 type1 12.1 9.0 11.0 13.0
2023-09-07 14:07:00 type2 101.1 16.0 18.0 NaN
如果不进行填充,由于第二行没有完整匹配,输出结果将会是:
type val 0 1 2
date
2023-09-07 13:22:00 type1 12.1 9.0 11.0 13.0
2023-09-07 14:07:00 type2 101.1 NaN NaN NaN
1
你可以试试 pd.merge_asof
加上切片,作为下一步操作:
输入的数据框(按索引排序):
df1
type value
time
2023-09-07 13:22:00 type1 12.7
2023-09-07 14:07:00 type2 101.1
df2
type1 type2
time
2023-09-07 08:00:00 1 2
2023-09-07 08:15:00 3 4
2023-09-07 08:30:00 5 6
2023-09-07 13:15:00 7 8
2023-09-07 13:30:00 9 10
2023-09-07 13:45:00 11 12
2023-09-07 14:00:00 13 14
2023-09-07 14:15:00 15 16
2023-09-07 14:30:00 17 18
N = 2
df2["time_tmp"] = df2.index
tmp = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction="forward")
df1[list(range(N))] = tmp.apply(
lambda x: df2.loc[x["time_tmp"] :, x["type"]][:N].values,
axis=1,
result_type="expand",
)
print(df1)
输出结果:
type value 0 1
time
2023-09-07 13:22:00 type1 12.7 9 11
2023-09-07 14:07:00 type2 101.1 16 18
补充说明:更“安全”的版本,使用 np.pad
(如果某一行的值少于 N):
def fn(row):
vals = df2.loc[row["time_tmp"] :, row["type"]][:N].values
if len(vals) < N:
vals = np.pad(
vals, mode="constant", pad_width=(0, N - len(vals)), constant_values=-1
)
return vals
df2["time_tmp"] = df2.index
tmp = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction="forward")
df1[list(range(N))] = tmp.apply(
fn,
axis=1,
result_type="expand",
)
print(df1)