大Pandas两栏的年际差异

2024-03-29 16:03:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一张桌子如下。第一栏是年份,第二栏是路面处理类型,第三栏是路面得分。我需要创建一个名为'year diff'的第三列,从当前得分的年份中减去最后一次治疗的年份。例如,2014年需要减去2013年,因为治疗9是在2013年完成的,结果1需要记录在相应单元格的col['year diff']中。2022年需要减去2020年,因为治疗10是在2020年完成的。你知道吗

enter image description here

非常感谢大家的帮助。你知道吗

真诚的

威尔逊


Tags: 类型记录diffcolyear年份桌子路面
3条回答

IIUC,你可以用:

df['identifier']=(df['year'].diff().eq(1)&df['treatment'].notnull()).cumsum()
df['year diff ']=df.groupby('identifier')['identifier'].apply\
(lambda x: pd.Series(np.where(x!=0,pd.Series(pd.factorize(x)[0]+1).cumsum().shift(),np.nan))).values
print(df)

或者如果你需要考虑基于治疗价值的得分差异:

df['identifier']=(df['year'].diff().eq(1) &df['treatment'].notnull()).cumsum()
df['year diff']=df.groupby('identifier')['score']\
.apply(lambda x : pd.Series(np.where(x!=0,x.diff().expanding().sum(),np.nan))).reset_index(drop=True)
df.loc[df['identifier']==0,'year diff']=np.nan
print(df)

    year  treatment  score  identifier  year diff 
0   2010        NaN      1           0         NaN
1   2011        NaN      2           0         NaN
2   2012        NaN      3           0         NaN
3   2013        9.0      4           1         NaN
4   2014        NaN      5           1         1.0
5   2015        NaN      6           1         2.0
6   2016        NaN      7           1         3.0
7   2017        NaN      8           1         4.0
8   2018        NaN      9           1         5.0
9   2019        NaN     10           1         6.0
10  2020       10.0     11           2         NaN
11  2021        NaN     12           2         1.0
12  2022        NaN     13           2         2.0
13  2023        NaN     14           2         3.0
14  2024        NaN     15           2         4.0
15  2025       12.0     16           3         NaN
16  2026        NaN     17           3         1.0
17  2027        NaN     18           3         2.0

用途:

#check not missing values
m = df['treatment'].notnull()
#create groups starting not missing values
s = m.cumsum()
#add missing values for first group and for not missing values
mask = (s == 0) | m

#subtract score with first score per group 
out =  df['score'] - df['score'].groupby(s).transform('first')
#add missing values
df['year diff'] = np.where(mask, np.nan, out)
print (df)
    year  treatment  score  year diff
0   2010        NaN      1        NaN
1   2011        NaN      2        NaN
2   2012        NaN      3        NaN
3   2013        9.0      4        NaN
4   2014        NaN      5        1.0
5   2015        NaN      6        2.0
6   2016        NaN      7        3.0
7   2017        NaN      8        4.0
8   2018        NaN      9        5.0
9   2019        NaN     10        6.0
10  2020       10.0     11        NaN
11  2021        NaN     12        1.0
12  2022        NaN     13        2.0
13  2023        NaN     14        3.0
14  2024        NaN     15        4.0
15  2025       12.0     16        NaN
16  2026        NaN     17        1.0
17  2027        NaN     18        2.0

如果要使用for循环:

df = pd.DataFrame(mydata) 
mylist = df.index[df['treatment'] != ''].tolist()

现在我们减去year

re_list= []
for index,row in df.iterrows():
    if index > min(mylist):
        m = [i for i in mylist if i <= index]
        re_list.append(df.iloc[index]['year'] - df.iloc[max(m)]['year'])
else:
    re_list.append(0)

df['Result'] = re_list

相关问题 更多 >