将单个数据帧多行合并为一行

2024-05-13 20:56:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一种列车交通数据的时间序列数据帧

df = pd.DataFrame({
    'train': [1, 1, 1, 2, 1, 2],
    'station': [1000, 1001, 1001, 1000, 1002, 1003],
    'time': pd.to_datetime(['20200525 13:30:00',
                            '20200525 13:45:00',
                            '20200525 13:50:00',
                            '20200525 13:35:00',
                            '20200525 14:10:00',
                            '20200525 14:00:00']),
    'mvt': [10, -1, 2, 20, 0, 0],
    },
    columns=['train', 'station', 'time', 'mvt'])

在车站上,列车或通过槽,或连接或分离一些客车。 由于这是一个时间序列数据,因此每个事件都位于单独的行上

我必须将同一列车在同一车站上的行合并,其中两个移动(mvt)一个接一个地发生(第二个时间戳>;第一个时间戳),并将移动放在两个单独的列中。(mvt_x和mvt_y)并保留最后一次操作的时间戳。 在单行通道上,mvt_y始终为NaN

以下是预期结果:

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00     10    NaN
1      1     1001 2020-05-25 13:50:00     -1    2.0
2      2     1000 2020-05-25 13:35:00     20    NaN
3      1     1002 2020-05-25 14:10:00      0    NaN
4      2     1003 2020-05-25 14:00:00      0    NaN

Tags: to数据dataframedftime时间train序列
2条回答

打败我吧。。。但这里有一个代码,用于多次访问同一站点的案例

# change df.time to the last time on each station
# sort by time to account for for multiple visits to a station
df = df.sort_values(['train', 'time', 'station'])
stopid = df.station.diff().cumsum().fillna(0).astype(int)
df.time = df.groupby(['train', 'station', stopid]).time.transform('last')

# create index for mvt on train_station groups
df = df.assign(mvt_id=df.groupby(['train', 'station', 'time']).cumcount())

# reshape df, similar to pivot
df = (
    df.set_index(['train', 'station', 'time', 'mvt_id'])
    .unstack('mvt_id').droplevel(0, axis=1)
    )
df.columns = ['mvt_x', 'mvt_y'] # hardcoded for only 2 movements per station
# might need a generator if expecting more than 2 mvts

df = df.reset_index()

print(df)

输出

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00   10.0    NaN
1      1     1001 2020-05-25 13:50:00   -1.0    2.0
2      1     1002 2020-05-25 14:10:00    0.0    NaN
3      2     1000 2020-05-25 13:35:00   20.0    NaN
4      2     1003 2020-05-25 14:00:00    0.0    NaN

创建数据帧

import pandas as pd

df = pd.DataFrame({
    'train': [1, 1, 1, 2, 1, 2],
    'station': [1000, 1001, 1001, 1000, 1002, 1003],
    'time': pd.to_datetime(['20200525 13:30:00',
                            '20200525 13:45:00',
                            '20200525 13:50:00',
                            '20200525 13:35:00',
                            '20200525 14:10:00',
                            '20200525 14:00:00']),
    'mvt': [10, -1, 2, 20, 0, 0],
    },
    columns=['train', 'station', 'time', 'mvt'])

计算秩,以识别(火车站)1个运动对2个运动对。然后使用秩重新塑造数据帧:

df['rank'] = df.groupby(['train', 'station'])['time'].rank().astype(int)

# re-shape the data frame - 'rank' is part of column label
x = (df.set_index(['train', 'station', 'rank'])
       .unstack(level='rank')
       .reset_index())

# find rows with a time with rank=2 ...
mask = x.loc[:, ('time', 2)].notna()

# ... and replace time-1 with time-2 (keep later time only)
x.loc[mask, ('time', 1)] = x.loc[mask, ('time', 2)]

# drop time-2
x = x.drop(columns=('time', 2))

# re-name columns
x.columns = ['train', 'station', 'time', 'mvt_x', 'mvt_y']

print(x)

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00   10.0    NaN
1      1     1001 2020-05-25 13:50:00   -1.0    2.0
2      1     1002 2020-05-25 14:10:00    0.0    NaN
3      2     1000 2020-05-25 13:35:00   20.0    NaN
4      2     1003 2020-05-25 14:00:00    0.0    NaN

相关问题 更多 >