获取数据帧两列之间成对序列的第一个和最后一个值

2024-04-25 08:14:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框架,它有3列Replaced_IDNew_IDInstallation Date{}

每个新的\u ID将替换替换的\u ID

Replaced_ID      New_ID             Installation Date (of New_ID)
     3             5                    16/02/2018
     5             7                    17/05/2019
     7             9                    21/06/2019
     9             11                   23/08/2020
    25             39                   16/02/2017
    39             41                   16/08/2018

我的目标是获取一个数据帧,其中包括序列的第一个最后一个记录。我只关心第一个替换的\u ID值和最后一个新的\u ID值

也就是说,从上面的数据框我想要这个

    Replaced_ID      New_ID             Installation Date (of New_ID)
        3              11                    23/08/2020
        25             41                    16/08/2018

按日期排序和执行轮班并不是我所能想象的解决方案

此外,我还尝试将列New_IDReplaced_ID连接起来,但情况并非如此,因为它只返回前面的序列

我需要找到一种方法来获取序列[3,5,7,9,11]&[25,41]结合{}&^所有行的{}列

我最关心的是获取第一个Replaced_ID值和最后一个New_ID值,而不是Installation Date,因为我最终可以执行join

有什么想法吗?谢谢


Tags: of数据框架id目标newdate排序
2条回答

假设日期已排序,则可以创建助手系列,然后创建groupby和aggregate:

df['Installation Date (of New_ID)']=pd.to_datetime(df['Installation Date (of New_ID)'])

s = df['Replaced_ID'].ne(df['New_ID'].shift()).cumsum()
out = df.groupby(s).agg(
      {"Replaced_ID":"first","New_ID":"last","Installation Date (of New_ID)":"last"}
     )

print(out)

   Replaced_ID  New_ID Installation Date (of New_ID)
1            3      11                    2020-08-23
2           25      41                    2018-08-16

助手系列s通过比较Replaced_ID与下一个值New_ID来帮助区分组,当它们不匹配时,它返回True。然后在^{}的帮助下,我们返回整个系列的总和,以创建单独的组:

print(s)

0    1
1    1
2    1
3    1
4    2
5    2

首先,让我们创建数据帧:

import pandas as pd
import numpy as np
from io import StringIO

data = """Replaced_ID,New_ID,Installation Date (of New_ID)
3,5,16/02/2018
5,7,17/05/2019
7,9,21/06/2019
9,11,23/08/2020
25,39,16/02/2017
39,41,16/08/2018
11,14,23/09/2020
41,42,23/10/2020
"""
### note that I've added two rows to check whether it works with non-consecutive rows

### defining some short hands
r = "Replaced_ID"
n = "New_ID"
i = "Installation Date (of New_ID)"

df = pd.read_csv(StringIO(data),header=0,parse_dates=True,sep=",")
df[i] =  pd.to_datetime(df[i], )

现在让我来看看我的实际解决方案:

a = df[[r,n]].values.flatten()
### returns a flat list of r and n values which clearly show duplicate entries, i.e.:
#  [ 3  5  5  7  7  9  9 11 25 39 39 41 11 14 41 42]

### now only get values that occur once, 
#   and reshape them nicely, such that the first column gives the lowest (replaced) id,
#   and the second column gives the highest (new) id, i.e.:
#    [[ 3 14]
#     [25 42]]
u, c = np.unique( a, return_counts=True)
res = u[c == 1].reshape(2,-1)

### now filter the dataframe where "New_ID" is equal to the second column of res, i.e. [14,42]:
#   and replace the entries in "r" with the "lowest possible values" of r
dfn = df[  df[n].isin(res[:,1].tolist()) ]
# print(dfn)
dfn.loc[:][r] = res[:,0]
print(dfn)

这将产生:

   Replaced_ID  New_ID Installation Date (of New_ID)
6            3      14                    2020-09-23
7           25      42                    2020-10-23

相关问题 更多 >