向原始数据框添加虚拟列

52 投票
2 回答
52167 浏览
提问于 2025-04-18 03:40

我有一个数据表,看起来像这样:

             EXEC_FULLNAME   YEAR BECAMECEO
CO_PER_ROL
5622        Ira A. Eichner   1992  19550101
5622        Ira A. Eichner   1993  19550101
5622        Ira A. Eichner   1994  19550101
5623       David P. Storch   1994  19961009
5623       David P. Storch   1995  19961009
5623       David P. Storch   1996  19961009

在“YEAR”这一列中,我想把年份列(比如1993、1994……到2009)添加到原来的数据表里。举个例子,如果某一行的“YEAR”值是1992,那么在1992这一列的值应该是1,其他年份的值则是0。

我尝试使用循环来实现这个,但因为我的数据量很大,循环似乎一直在运行,没法结束。

2 个回答

0

另一种方法是使用 str.get_dummies()。这个方法适用于字符串类型的值,所以你需要先把数据转换成字符串。

dummies = df['YEAR'].astype(str).str.get_dummies()
df = pd.concat([df.drop(columns='YEAR'), dummies], axis=1)

res


还有一种方法是使用 OneHotEncoder,这个工具来自 sklearn.preprocessing

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
df[ohe.get_feature_names_out()] = ohe.fit_transform(df[['YEAR']]).toarray()
85
In [77]: df = pd.concat([df, pd.get_dummies(df['YEAR'])], axis=1); df
Out[77]: 
      JOINED_CO GENDER    EXEC_FULLNAME  GVKEY  YEAR    CONAME  BECAMECEO  \
5622        NaN   MALE   Ira A. Eichner   1004  1992  AAR CORP   19550101   
5622        NaN   MALE   Ira A. Eichner   1004  1993  AAR CORP   19550101   
5622        NaN   MALE   Ira A. Eichner   1004  1994  AAR CORP   19550101   
5622        NaN   MALE   Ira A. Eichner   1004  1995  AAR CORP   19550101   
5622        NaN   MALE   Ira A. Eichner   1004  1996  AAR CORP   19550101   
5622        NaN   MALE   Ira A. Eichner   1004  1997  AAR CORP   19550101   
5622        NaN   MALE   Ira A. Eichner   1004  1998  AAR CORP   19550101   
5623        NaN   MALE  David P. Storch   1004  1992  AAR CORP   19961009   
5623        NaN   MALE  David P. Storch   1004  1993  AAR CORP   19961009   
5623        NaN   MALE  David P. Storch   1004  1994  AAR CORP   19961009   
5623        NaN   MALE  David P. Storch   1004  1995  AAR CORP   19961009   
5623        NaN   MALE  David P. Storch   1004  1996  AAR CORP   19961009   

      REJOIN   LEFTOFC    LEFTCO  RELEFT    REASON  PAGE  1992  1993  1994  \
5622     NaN  19961001  19990531     NaN  RESIGNED    79     1     0     0   
5622     NaN  19961001  19990531     NaN  RESIGNED    79     0     1     0   
5622     NaN  19961001  19990531     NaN  RESIGNED    79     0     0     1   
5622     NaN  19961001  19990531     NaN  RESIGNED    79     0     0     0   
5622     NaN  19961001  19990531     NaN  RESIGNED    79     0     0     0   
5622     NaN  19961001  19990531     NaN  RESIGNED    79     0     0     0   
5622     NaN  19961001  19990531     NaN  RESIGNED    79     0     0     0   
5623     NaN       NaN       NaN     NaN       NaN    57     1     0     0   
5623     NaN       NaN       NaN     NaN       NaN    57     0     1     0   
5623     NaN       NaN       NaN     NaN       NaN    57     0     0     1   
5623     NaN       NaN       NaN     NaN       NaN    57     0     0     0   
5623     NaN       NaN       NaN     NaN       NaN    57     0     0     0   

      1995  1996  1997  1998  
5622     0     0     0     0  
5622     0     0     0     0  
5622     0     0     0     0  
5622     1     0     0     0  
5622     0     1     0     0  
5622     0     0     1     0  
5622     0     0     0     1  
5623     0     0     0     0  
5623     0     0     0     0  
5623     0     0     0     0  
5623     1     0     0     0  
5623     0     1     0     0  
df = pd.concat([df.drop('YEAR', axis=1), pd.get_dummies(df['YEAR'])], axis=1)

如果你想删除YEAR这一列,你可以接着使用del df['YEAR']这行代码。或者,在调用concat之前,先从df中去掉YEAR这一列:

撰写回答