如何连接数据帧组中的列值?

2024-06-16 08:52:55 发布

您现在位置:Python中文网/ 问答频道 /正文

问题

我希望每隔两年对数据框条目进行分组,用分隔符“#”连接列值,并用分隔符“;”连接相同间隔内的条目

我以前通过iterating through the years and creating a new DataFrame实现了这一点,但它相当混乱——我更喜欢矢量化的解决方案

输入示例:

  dx_code patient_id                           dx_name  year
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012
1   H53.9    Z324563                Visual disturbance  2014
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4    None    Z273652    Disorder of bone and cartilage  2004
5   272.0    Z273652         Pure hypercholesterolemia  2006
6  729.81    Z273652                  Swelling of limb  2012
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011

示例输出:

  patient_id                         2004–2005_dx  \
0    Z324563                                 None   
1    Z273652  None#Disorder of bone and cartilage   

                      2006–2007_dx                          2008–2009_dx  \
0                             None  725#Polymyalgia rheumatica (CMS/HCC)   
1  272.0#Pure hypercholesterolemia                                  None   

                                                                 2010–2011_dx  \
0                                        725#Polymyalgia rheumatica (CMS/HCC)   
1  446.5#Giant cell arteritis (CMS/HCC); 725#Polymyalgia rheumatica (CMS/HCC)   

                           2012–2013_dx                   2014_dx  \
0  427.31#Atrial fibrillation (CMS/HCC)  H53.9#Visual disturbance   
1               729.81#Swelling of limb                      None   

  unknown_time_dx  
0            None  
1            None  

我试过的

this回答之后,我有以下代码:

self.data.groupby(["patient_id", pd.Grouper(freq="2Y", key="date")])
                .sum()
                .unstack(fill_value=""))

它的输出如下:

              dx_code                                                                     dx_name                                                                                                                                    
date       2004-12-31 2006-12-31 2010-12-31 2012-12-31 2014-12-31                      2004-12-31                 2006-12-31                        2010-12-31                                         2012-12-31          2014-12-31
patient_id                                                                                                                                                                                                                           
Z273652             0      272.0      446.5  729.81725             Disorder of bone and cartilage  Pure hypercholesterolemia    Giant cell arteritis (CMS/HCC)   Swelling of limbPolymyalgia rheumatica (CMS/HCC)                    
Z324563                                 725  427.31725      H53.9                                                             Polymyalgia rheumatica (CMS/HCC)  Atrial fibrillation (CMS/HCC)Polymyalgia rheum...  Visual disturbance

但是,我似乎不知道如何组合这两个组中的列值


Tags: andofnoneidcmsvisualpatientdx
1条回答
网友
1楼 · 发布于 2024-06-16 08:52:55

好的,让我们创建起始数据帧:

content = """  dx_code  patient_id  dx_name  year
0  427.31  Z324563  Atrial fibrillation (CMS/HCC)  2012
1  H53.9  Z324563  Visual disturbance  2014
2  725  Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3  725  Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4  None  Z273652  Disorder of bone and cartilage  2004
5  272.0  Z273652  Pure hypercholesterolemia  2006
6  729.81  Z273652  Swelling of limb  2012
7  446.5  Z273652  Giant cell arteritis (CMS/HCC)  2010
8  725  Z273652  Polymyalgia rheumatica (CMS/HCC)  2011
"""
from io import StringIO
df = pd.read_csv(StringIO(content), 
            sep='  ')
print(df)

  dx_code patient_id                           dx_name  year
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012
1   H53.9    Z324563                Visual disturbance  2014
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4    None    Z273652    Disorder of bone and cartilage  2004
5   272.0    Z273652         Pure hypercholesterolemia  2006
6  729.81    Z273652                  Swelling of limb  2012
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011

现在,定义垃圾箱:

import numpy as np
#b = [0,2004,2006,2008,2010,2012,np.inf] # you can make the list if you wish (I suggest start with 0 and finish with np.inf)
b = [x for x in range(2002,2020,2)] # or just to use bigger ranges

所以

df_cut = df.assign(PopGroup=pd.cut(df.year,bins=b))
print(df_cut)
  dx_code patient_id                           dx_name  year      PopGroup
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012  (2010, 2012]
1   H53.9    Z324563                Visual disturbance  2014  (2012, 2014]
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009  (2008, 2010]
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011  (2010, 2012]
4    None    Z273652    Disorder of bone and cartilage  2004  (2002, 2004]
5   272.0    Z273652         Pure hypercholesterolemia  2006  (2004, 2006]
6  729.81    Z273652                  Swelling of limb  2012  (2010, 2012]
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010  (2008, 2010]
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011  (2010, 2012]

让我们加入dx_代码和dx_名称列:

df_cut['DX_code_name'] = df_cut[['dx_code', 'dx_name']].agg('#'.join, axis=1)
print(df_cut)
  dx_code patient_id  ...      PopGroup                          DX_code_name
0  427.31    Z324563  ...  (2010, 2012]  427.31#Atrial fibrillation (CMS/HCC)
1   H53.9    Z324563  ...  (2012, 2014]              H53.9#Visual disturbance
2     725    Z324563  ...  (2008, 2010]  725#Polymyalgia rheumatica (CMS/HCC)
3     725    Z324563  ...  (2010, 2012]  725#Polymyalgia rheumatica (CMS/HCC)
4    None    Z273652  ...  (2002, 2004]   None#Disorder of bone and cartilage
5   272.0    Z273652  ...  (2004, 2006]       272.0#Pure hypercholesterolemia
6  729.81    Z273652  ...  (2010, 2012]               729.81#Swelling of limb
7   446.5    Z273652  ...  (2008, 2010]  446.5#Giant cell arteritis (CMS/HCC)
8     725    Z273652  ...  (2010, 2012]  725#Polymyalgia rheumatica (CMS/HCC)

最后,我们使用pivot_表:

table = pd.pivot_table(df_cut, 
                       values=['DX_code_name'], 
                       index=['patient_id'],
                    columns=['year'],
                    aggfunc=lambda x: '# '.join(x),
                    fill_value=np.nan
                    )

让我们看看:

table
DX_code_name
year    2004    2006    2009    2010    2011    2012    2014
patient_id                          
Z273652 None#Disorder of bone and cartilage 272.0#Pure hypercholesterolemia NaN 446.5#Giant cell arteritis (CMS/HCC)    725#Polymyalgia rheumatica (CMS/HCC)    729.81#Swelling of limb NaN
Z324563 NaN NaN 725#Polymyalgia rheumatica (CMS/HCC)    NaN 725#Polymyalgia rheumatica (CMS/HCC)    427.31#Atrial fibrillation (CMS/HCC)    H53.9#Visual disturbance

相关问题 更多 >