将时间序列附加到数据框

1 投票
2 回答
63 浏览
提问于 2025-04-14 17:50

我有一个数据表,看起来是这样的:

"2023-09-07 13:22" type1 12.7
"2023-09-07 14:07" type2 101.1 

还有一个单独的数据表,里面是每种类型的等间隔时间序列:

                   type1     type2
2023-09-07 08:00       1         2
2023-09-07 08:15       3         4
2023-09-07 08:30       5         6
...
2023-09-07 13:15       7         8
2023-09-07 13:30       9        10      
2023-09-07 13:45      11        12
2023-09-07 14:00      13        14
2023-09-07 14:15      15        16
2023-09-07 14:30      17        18
...

我想把第二个数据表中的2个(或者更多)值,作为一行,添加到第一个数据表的每一行中,从给定的时间戳之后开始。

所以在这个例子中,答案会是:

"2023-09-07 13:22" type1 12.7    9 11
"2023-09-07 14:07" type2 101.1  16 18

我可以遍历第一个数据表的每一行,每次在第二个数据表中找到一个切片,但这样做速度比较慢。我在想有没有更好的解决办法。这似乎是一个很常见的任务。

生成输入数据表的代码:

df1 = pd.DataFrame(columns = ["date", "type", "val"])
df1.loc[0] = [pd.to_datetime("2023-09-07 13:22:00"), "type1", 12.1]
df1.loc[1] = [pd.to_datetime("2023-09-07 14:07:00"), "type2", 101.1]
df1 = df1.set_index("date")
df2 = pd.DataFrame()
df2["date"] = pd.to_datetime(["2023-09-07 08:00", "2023-09-07 08:15","2023-09-07 08:30", "2023-09-07 13:15","2023-09-07 13:30", "2023-09-07 13:45","2023-09-07 14:00", "2023-09-07 14:15","2023-09-07 14:30"])
df2["type1"] = [1,3,5,7,9,11,13,15,17]
df2["type2"] = [2,4,6,8,10,12,14,16,18]

2 个回答

3

为了实现完全的向量化处理,可以使用numpy的 sliding_window_view 来制作一个滚动的DataFrame,然后用 merge_asof 来合并数据:

from numpy.lib.stride_tricks import sliding_window_view as swv

N = 2

# ensure date is the index
df2 = (df2.set_index('date')
          .rename_axis(columns='type')
       )

# create the sliding window
# reshape to "melted" format
tmp = (pd.DataFrame(swv(df2, N, axis=0)
                    .reshape(-1, N),
                    index=df2.iloc[:-N+1].stack(dropna=False).index)
         .reset_index('type')
       )

# merge_asof
out = pd.merge_asof(df1, tmp,
                    left_index=True, right_index=True,
                    by='type', direction='forward')

输出结果:

                      type    val   0   1
date                                     
2023-09-07 13:22:00  type1   12.1   9  11
2023-09-07 14:07:00  type2  101.1  16  18

中间结果 tmp:

                      type   0   1
date                              
2023-09-07 08:00:00  type1   1   3
2023-09-07 08:00:00  type2   2   4
2023-09-07 08:15:00  type1   3   5
2023-09-07 08:15:00  type2   4   6
2023-09-07 08:30:00  type1   5   7
2023-09-07 08:30:00  type2   6   8
2023-09-07 13:15:00  type1   7   9
2023-09-07 13:15:00  type2   8  10
2023-09-07 13:30:00  type1   9  11
2023-09-07 13:30:00  type2  10  12
2023-09-07 13:45:00  type1  11  13
2023-09-07 13:45:00  type2  12  14
2023-09-07 14:00:00  type1  13  15
2023-09-07 14:00:00  type2  14  16
2023-09-07 14:15:00  type1  15  17
2023-09-07 14:15:00  type2  16  18

注意。如果需要(比如为了获取不完整的切片),你可以在 swv 之前对DataFrame进行 pad 操作。

from numpy.lib.stride_tricks import sliding_window_view as swv

N = 3

# ensure date is the index
df2 = (df2.set_index('date')
          .rename_axis(columns='type')
       )

# create the sliding window
# reshape to "melted" format
tmp = (pd.DataFrame(swv(np.pad(df2.astype(float), ((0, N-1), (0, 0)),
                               constant_values=np.nan),
                        N, axis=0)
                    .reshape(-1, N),
                    index=df2.stack(dropna=False).index)
         .reset_index('type')
       )

# merge_asof
out = pd.merge_asof(df1, tmp,
                    left_index=True, right_index=True,
                    by='type', direction='forward')

输出结果:

                      type    val     0     1     2
date                                               
2023-09-07 13:22:00  type1   12.1   9.0  11.0  13.0
2023-09-07 14:07:00  type2  101.1  16.0  18.0   NaN

如果不进行填充,由于第二行没有完整匹配,输出结果将会是:

                      type    val    0     1     2
date                                              
2023-09-07 13:22:00  type1   12.1  9.0  11.0  13.0
2023-09-07 14:07:00  type2  101.1  NaN   NaN   NaN
1

你可以试试 pd.merge_asof 加上切片,作为下一步操作:

输入的数据框(按索引排序):

df1

                      type  value
time                             
2023-09-07 13:22:00  type1   12.7
2023-09-07 14:07:00  type2  101.1

df2

                     type1  type2
time                             
2023-09-07 08:00:00      1      2
2023-09-07 08:15:00      3      4
2023-09-07 08:30:00      5      6
2023-09-07 13:15:00      7      8
2023-09-07 13:30:00      9     10
2023-09-07 13:45:00     11     12
2023-09-07 14:00:00     13     14
2023-09-07 14:15:00     15     16
2023-09-07 14:30:00     17     18

N = 2

df2["time_tmp"] = df2.index
tmp = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction="forward")
df1[list(range(N))] = tmp.apply(
    lambda x: df2.loc[x["time_tmp"] :, x["type"]][:N].values,
    axis=1,
    result_type="expand",
)
print(df1)

输出结果:

                      type  value   0   1
time                                     
2023-09-07 13:22:00  type1   12.7   9  11
2023-09-07 14:07:00  type2  101.1  16  18

补充说明:更“安全”的版本,使用 np.pad(如果某一行的值少于 N):

def fn(row):
    vals = df2.loc[row["time_tmp"] :, row["type"]][:N].values
    if len(vals) < N:
        vals = np.pad(
            vals, mode="constant", pad_width=(0, N - len(vals)), constant_values=-1
        )
    return vals


df2["time_tmp"] = df2.index
tmp = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction="forward")
df1[list(range(N))] = tmp.apply(
    fn,
    axis=1,
    result_type="expand",
)
print(df1)

撰写回答