使用列表推导式通过迭代多个其他列中的每一行来创建2个新列

2024-06-16 11:49:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个名为df的数据帧,看起来与此类似(除了'Date'列的数量增加到Date_8并且有几百个客户机-我在这里简化了它)

Client_ID  Date_1        Date_2        Date_3        Date_4
C1019876   relationship  no change     no change     no change
C1018765   no change     single        no change     no change    
C1017654   single        no change     relationship  NaN        
C1016543   NaN           relationship  no change     single
C1015432   NaN           no change     single        NaN

我想创建两个新列,first_statuslast_statusfirst_status应等于4个日期列中的第一个给定关系状态,即relationshipsingle的第一个响应,而last_status应等于4个日期列中的最后一个给定关系状态。结果df应该是这样的

Client_ID  Date_1        Date_2        Date_3        Date_4        first_status  last_status
C1019876   relationship  no change     no change     no change     relationship  relationship 
C1018765   no change     single        no change     no change     single        single    
C1017654   single        no change     relationship  NaN           single        relationship   
C1016543   NaN           relationship  no change     single        relationship  single 
C1015432   NaN           no change     single        NaN           single        single

我认为这两列可以通过列表理解来创建,但我不知道如何创建。对于first_status列,我认为代码会对df中的每一行执行如下操作:

  • 查找给定值的第1 Date列(过滤掉NaN)
  • 如果值=no change,则转到下一个Date
  • 如果值=relationshipfirst_status=relationship
  • 如果值=singlefirst_status=single

对于last_status列,我认为代码将在df中的每一行上执行如下操作:

  • 查找给定值的最后Date列(过滤掉NaN)
  • 如果值=no change,则转到上一列Date
  • 如果值=relationshiplast_status=relationship
  • 如果值=singlelast_status=single

Tags: noclientiddfdatestatusnanchange
2条回答

我想如果你真的想使用列表理解,你可以,但是@yatu的解决方案会更快:

# unstack and find the first column index where relationship or single occurs
first = df.unstack().groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())
last = df.unstack()[::-1].groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())

# list comprehension to find the index and column index pair
f_list = [x for x in enumerate(first)]
l_list = [x for x in enumerate(last)]

# list comprehension with iloc
f_val = [df.iloc[f_list[i]] for i in range(len(f_list))]
l_val = [df.loc[:, ::-1].iloc[l_list[i]] for i in range(len(l_list))]

# create columns
df['first'] = f_val 
df['last'] = l_val

  Client_ID        Date_1        Date_2        Date_3     Date_4  \
0  C1019876  relationship     no change     no change  no change   
1  C1018765     no change        single     no change  no change   
2  C1017654        single     no change  relationship        NaN   
3  C1016543           NaN  relationship     no change     single   
4  C1015432           NaN     no change        single        NaN   

          first          last  
0  relationship  relationship  
1        single        single  
2        single  relationship  
3  relationship        single  
4        single        single  

时间结果:8 ms ± 230 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)

可以将^{}no changenp.nan一起使用,并分别使用bfillffill选择第一个和最后一个有效值:

df = df.replace('no change', np.nan)
df['first_status'] = df.bfill(axis=1).Date_1
df['last_status'] = df.loc[:,:'Date_4'].ffill(axis=1).Date_4
#df = df.fillna('no_change') # if needed

 Client_ID        Date_1        Date_2        Date_3  Date_4  first_status  \
0  C1019876  relationship           NaN           NaN     NaN  relationship   
1  C1018765           NaN        single           NaN     NaN        single   
2  C1017654        single           NaN  relationship     NaN        single   
3  C1016543           NaN  relationship           NaN  single  relationship   
4  C1015432           NaN           NaN        single     NaN        single   

    last_status  
0  relationship  
1        single  
2  relationship  
3        single  
4        single  

如果有Date列到n,请使用df.loc[:,:'Date_n'].ffill(axis=1).Date_n作为last_status

相关问题 更多 >