如何连接数据帧组中的列值？

问题

我希望每隔两年对数据框条目进行分组，用分隔符“#”连接列值，并用分隔符“；”连接相同间隔内的条目

我以前通过iterating through the years and creating a new DataFrame实现了这一点，但它相当混乱——我更喜欢矢量化的解决方案

输入示例：

dx_code patient_id dx_name year 0 427.31 Z324563 Atrial fibrillation (CMS/HCC) 2012 1 H53.9 Z324563 Visual disturbance 2014 2 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2009 3 725 Z324563 Polymyalgia rheumatica (CMS/HCC) 2011 4 None Z273652 Disorder of bone and cartilage 2004 5 272.0 Z273652 Pure hypercholesterolemia 2006 6 729.81 Z273652 Swelling of limb 2012 7 446.5 Z273652 Giant cell arteritis (CMS/HCC) 2010 8 725 Z273652 Polymyalgia rheumatica (CMS/HCC) 2011

示例输出：

patient_id 2004–2005_dx \ 0 Z324563 None 1 Z273652 None#Disorder of bone and cartilage 2006–2007_dx 2008–2009_dx \ 0 None 725#Polymyalgia rheumatica (CMS/HCC) 1 272.0#Pure hypercholesterolemia None 2010–2011_dx \ 0 725#Polymyalgia rheumatica (CMS/HCC) 1 446.5#Giant cell arteritis (CMS/HCC); 725#Polymyalgia rheumatica (CMS/HCC) 2012–2013_dx 2014_dx \ 0 427.31#Atrial fibrillation (CMS/HCC) H53.9#Visual disturbance 1 729.81#Swelling of limb None unknown_time_dx 0 None 1 None

我试过的

在this回答之后，我有以下代码：

self.data.groupby(["patient_id", pd.Grouper(freq="2Y", key="date")]) .sum() .unstack(fill_value=""))

它的输出如下：

dx_code dx_name date 2004-12-31 2006-12-31 2010-12-31 2012-12-31 2014-12-31 2004-12-31 2006-12-31 2010-12-31 2012-12-31 2014-12-31 patient_id Z273652 0 272.0 446.5 729.81725 Disorder of bone and cartilage Pure hypercholesterolemia Giant cell arteritis (CMS/HCC) Swelling of limbPolymyalgia rheumatica (CMS/HCC) Z324563 725 427.31725 H53.9 Polymyalgia rheumatica (CMS/HCC) Atrial fibrillation (CMS/HCC)Polymyalgia rheum... Visual disturbance

但是，我似乎不知道如何组合这两个组中的列值

1条回答

网友

1楼 · 发布于 2024-06-16 08:52:55

好的，让我们创建起始数据帧：

content = """  dx_code  patient_id  dx_name  year
0  427.31  Z324563  Atrial fibrillation (CMS/HCC)  2012
1  H53.9  Z324563  Visual disturbance  2014
2  725  Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3  725  Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4  None  Z273652  Disorder of bone and cartilage  2004
5  272.0  Z273652  Pure hypercholesterolemia  2006
6  729.81  Z273652  Swelling of limb  2012
7  446.5  Z273652  Giant cell arteritis (CMS/HCC)  2010
8  725  Z273652  Polymyalgia rheumatica (CMS/HCC)  2011
"""
from io import StringIO
df = pd.read_csv(StringIO(content), 
            sep='  ')
print(df)

  dx_code patient_id                           dx_name  year
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012
1   H53.9    Z324563                Visual disturbance  2014
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011
4    None    Z273652    Disorder of bone and cartilage  2004
5   272.0    Z273652         Pure hypercholesterolemia  2006
6  729.81    Z273652                  Swelling of limb  2012
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011

现在，定义垃圾箱：

import numpy as np
#b = [0,2004,2006,2008,2010,2012,np.inf] # you can make the list if you wish (I suggest start with 0 and finish with np.inf)
b = [x for x in range(2002,2020,2)] # or just to use bigger ranges

所以

df_cut = df.assign(PopGroup=pd.cut(df.year,bins=b))
print(df_cut)
  dx_code patient_id                           dx_name  year      PopGroup
0  427.31    Z324563     Atrial fibrillation (CMS/HCC)  2012  (2010, 2012]
1   H53.9    Z324563                Visual disturbance  2014  (2012, 2014]
2     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2009  (2008, 2010]
3     725    Z324563  Polymyalgia rheumatica (CMS/HCC)  2011  (2010, 2012]
4    None    Z273652    Disorder of bone and cartilage  2004  (2002, 2004]
5   272.0    Z273652         Pure hypercholesterolemia  2006  (2004, 2006]
6  729.81    Z273652                  Swelling of limb  2012  (2010, 2012]
7   446.5    Z273652    Giant cell arteritis (CMS/HCC)  2010  (2008, 2010]
8     725    Z273652  Polymyalgia rheumatica (CMS/HCC)  2011  (2010, 2012]

让我们加入dx_代码和dx_名称列：

df_cut['DX_code_name'] = df_cut[['dx_code', 'dx_name']].agg('#'.join, axis=1)
print(df_cut)
  dx_code patient_id  ...      PopGroup                          DX_code_name
0  427.31    Z324563  ...  (2010, 2012]  427.31#Atrial fibrillation (CMS/HCC)
1   H53.9    Z324563  ...  (2012, 2014]              H53.9#Visual disturbance
2     725    Z324563  ...  (2008, 2010]  725#Polymyalgia rheumatica (CMS/HCC)
3     725    Z324563  ...  (2010, 2012]  725#Polymyalgia rheumatica (CMS/HCC)
4    None    Z273652  ...  (2002, 2004]   None#Disorder of bone and cartilage
5   272.0    Z273652  ...  (2004, 2006]       272.0#Pure hypercholesterolemia
6  729.81    Z273652  ...  (2010, 2012]               729.81#Swelling of limb
7   446.5    Z273652  ...  (2008, 2010]  446.5#Giant cell arteritis (CMS/HCC)
8     725    Z273652  ...  (2010, 2012]  725#Polymyalgia rheumatica (CMS/HCC)

最后，我们使用pivot_表：

table = pd.pivot_table(df_cut, 
                       values=['DX_code_name'], 
                       index=['patient_id'],
                    columns=['year'],
                    aggfunc=lambda x: '# '.join(x),
                    fill_value=np.nan
                    )

让我们看看：

table
DX_code_name
year    2004    2006    2009    2010    2011    2012    2014
patient_id                          
Z273652 None#Disorder of bone and cartilage 272.0#Pure hypercholesterolemia NaN 446.5#Giant cell arteritis (CMS/HCC)    725#Polymyalgia rheumatica (CMS/HCC)    729.81#Swelling of limb NaN
Z324563 NaN NaN 725#Polymyalgia rheumatica (CMS/HCC)    NaN 725#Polymyalgia rheumatica (CMS/HCC)    427.31#Atrial fibrillation (CMS/HCC)    H53.9#Visual disturbance

问题

我试过的

相关问题更多 >

编程相关推荐

热门问题

热门文章