条件移位:从“当前行值”中减去“上一行值”,在“当前行值”中有多个条件

2024-05-19 00:43:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据帧:

Disease     HeartRate   State    MonthStart   MonthEnd    
Covid       89          Texas    2020-02-28   2020-03-31      
Covid       91          Texas    2020-03-31   2020-04-30     
Covid       87          Texas    2020-07-31   2020-08-30      
Cancer      90          Texas    2020-02-28   2020-03-31 
Cancer      88          Florida  2020-03-31   2020-04-30      
Covid       89          Florida  2020-02-28   2020-03-31      
Covid       87          Florida  2020-03-31   2020-04-30      
Flu         90          Florida  2020-02-28   2020-03-31        

我必须从“Heart”列中的“current row”中减去“previous row”,然后创建一个新的行

但是,有一些条件:

  1. 只有当“疾病”和“状态”列具有相同的值时,才会减去行值
  2. 仅当行在连续月份时,才会减去行值。如果时间线中有中断,则不会减去值
  3. 如果没有要减去的前一行值,则仅输入“心率”值

期望输出:

Disease     HeartRate   State    MonthStart   MonthEnd     HeartRateDiff
Covid       89          Texas    2020-02-28   2020-03-31    89      
Covid       91          Texas    2020-03-31   2020-04-30    2     
Covid       87          Texas    2020-07-31   2020-08-30    87      
Cancer      90          Texas    2020-02-28   2020-03-31    90 
Cancer      88          Florida  2020-03-31   2020-04-30    88          
Covid       89          Florida  2020-02-28   2020-03-31    89      
Covid       87          Florida  2020-03-31   2020-04-30    -2      
Flu         90          Florida  2020-02-28   2020-03-31    90      

我知道如何使用以下代码从当前行中减去前一行:

df[‘DiffHeartRate’] = df.groupby(['Disease', 'State'])['HeartRate'].transform(pd.Series.diff)

然而,我面临两个问题:

  1. 如果没有要减去的前一行,则保持相同的值
  2. 检查时间表的连续性(是否下个月)

有更聪明的方法吗?任何帮助都将不胜感激。谢谢


Tags: 数据dfrowstatecancerdiseasehearttexas
3条回答

尝试:

import numpy as np

df.MonthStart = pd.to_datetime(df.MonthStart)
df.MonthEnd = pd.to_datetime(df.MonthEnd)


def cal_diff(x):
    x['DiffHeartRate'] = np.where(x['MonthEnd'].shift().dt.month.eq(
        x['MonthStart'].dt.month), x['HeartRate'].diff(), x['HeartRate'])
    return x


df = df.groupby(['Disease', 'State']).apply(cal_diff)

输出

  Disease  HeartRate    State MonthStart   MonthEnd DiffHeartRate
0   Covid         89    Texas 2020-02-28 2020-03-31            89
1   Covid         91    Texas 2020-03-31 2020-04-30             2
2   Covid         87    Texas 2020-07-31 2020-08-30            87
3  Cancer         90    Texas 2020-02-28 2020-03-31            90
4  Cancer         88  Florida 2020-03-31 2020-04-30            88
5   Covid         89  Florida 2020-02-28 2020-03-31            89
6   Covid         87  Florida 2020-03-31 2020-04-30            -2
7     Flu         90  Florida 2020-02-28 2020-03-31            90

您可以通过^{}.groupby().transform()一起执行,如下所示:

df['HeartRateDiff'] = (df['HeartRate'].mask(
                           df['MonthStart'].groupby([df['Disease'], df['State']]).transform('diff').lt(np.timedelta64(2,'M')),
                           df.groupby(['Disease', 'State'])['HeartRate'].transform('diff')
                           )
                      )

详情:

(1)首先,我们确保日期列采用日期时间格式,而不是字符串:

如果日期列已采用日期时间格式,则可以跳过此步骤

df['MonthStart'] = pd.to_datetime(df['MonthStart'])
df['MonthEnd'] = pd.to_datetime(df['MonthEnd'])

(2)心率变化(组内)通过以下方式获得:

df.groupby(['Disease', 'State'])['HeartRate'].transform('diff')

我们可以简单地在.transform()中使用'diff'而不是使用pd.Series.diff来实现相同的结果

(3)通过以下条件检查时间线的连续性(下个月与否):

df['MonthStart'].groupby([df['Disease'], df['State']]).transform('diff').lt(np.timedelta64(2,'M'))

我们检查与前一日期(组内)的时间差,严格小于2个月,以确保它在下一个月。我们无法检查<;=自连续两个月开始的某个日期差起的1个月可以是32天。请注意,此检查也适用于年假(从12月到1月),其中,仅使用月份数字(从12到1)进行逻辑检查将给出错误的结果

(4)最后,我们在现有列HeartRate上使用^{}得到新列:

^{}在其第一个参数中测试条件,并在条件为真时将行替换为其第二个参数中的值。当不满足条件时,它保留行的原始值。因此,实现我们有条件地替代价值观的目标

输出:

  Disease  HeartRate    State MonthStart   MonthEnd  HeartRateDiff
0   Covid         89    Texas 2020-02-28 2020-03-31             89
1   Covid         91    Texas 2020-03-31 2020-04-30              2
2   Covid         87    Texas 2020-07-31 2020-08-30             87
3  Cancer         90    Texas 2020-02-28 2020-03-31             90
4  Cancer         88  Florida 2020-03-31 2020-04-30             88
5   Covid         89  Florida 2020-02-28 2020-03-31             89
6   Covid         87  Florida 2020-03-31 2020-04-30             -2
7     Flu         90  Florida 2020-02-28 2020-03-31             90

您可以尝试以下方法:

df['DiffHeartRate']=(df.groupby(['Disease', 'State', 
          (df.MonthStart.dt.month.ne(df.MonthStart.dt.month.shift()+1)).cumsum()])['HeartRate']
 .apply(lambda x: x.diff())).fillna(df.HeartRate)

    Disease HeartRate   State   MonthStart  MonthEnd    DiffHeartRate
0   Covid   89          Texas   2020-02-28  2020-03-31  89.0
1   Covid   91          Texas   2020-03-31  2020-04-30  2.0
2   Covid   87          Texas   2020-07-31  2020-08-30  87.0
3   Cancer  90          Texas   2020-02-28  2020-03-31  90.0
4   Cancer  88          Florida 2020-03-31  2020-04-30  88.0
5   Covid   89          Florida 2020-02-28  2020-03-31  89.0
6   Covid   87          Florida 2020-03-31  2020-04-30  -2.0
7   Flu     90          Florida 2020-02-28  2020-03-31  90.0

逻辑与其他答案相同,但表达方式不同

相关问题 更多 >

    热门问题