向原始数据框添加虚拟列
我有一个数据表,看起来像这样:
EXEC_FULLNAME YEAR BECAMECEO
CO_PER_ROL
5622 Ira A. Eichner 1992 19550101
5622 Ira A. Eichner 1993 19550101
5622 Ira A. Eichner 1994 19550101
5623 David P. Storch 1994 19961009
5623 David P. Storch 1995 19961009
5623 David P. Storch 1996 19961009
在“YEAR”这一列中,我想把年份列(比如1993、1994……到2009)添加到原来的数据表里。举个例子,如果某一行的“YEAR”值是1992,那么在1992这一列的值应该是1,其他年份的值则是0。
我尝试使用循环来实现这个,但因为我的数据量很大,循环似乎一直在运行,没法结束。
2 个回答
0
另一种方法是使用 str.get_dummies()
。这个方法适用于字符串类型的值,所以你需要先把数据转换成字符串。
dummies = df['YEAR'].astype(str).str.get_dummies()
df = pd.concat([df.drop(columns='YEAR'), dummies], axis=1)
还有一种方法是使用 OneHotEncoder
,这个工具来自 sklearn.preprocessing
。
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
df[ohe.get_feature_names_out()] = ohe.fit_transform(df[['YEAR']]).toarray()
85
In [77]: df = pd.concat([df, pd.get_dummies(df['YEAR'])], axis=1); df
Out[77]:
JOINED_CO GENDER EXEC_FULLNAME GVKEY YEAR CONAME BECAMECEO \
5622 NaN MALE Ira A. Eichner 1004 1992 AAR CORP 19550101
5622 NaN MALE Ira A. Eichner 1004 1993 AAR CORP 19550101
5622 NaN MALE Ira A. Eichner 1004 1994 AAR CORP 19550101
5622 NaN MALE Ira A. Eichner 1004 1995 AAR CORP 19550101
5622 NaN MALE Ira A. Eichner 1004 1996 AAR CORP 19550101
5622 NaN MALE Ira A. Eichner 1004 1997 AAR CORP 19550101
5622 NaN MALE Ira A. Eichner 1004 1998 AAR CORP 19550101
5623 NaN MALE David P. Storch 1004 1992 AAR CORP 19961009
5623 NaN MALE David P. Storch 1004 1993 AAR CORP 19961009
5623 NaN MALE David P. Storch 1004 1994 AAR CORP 19961009
5623 NaN MALE David P. Storch 1004 1995 AAR CORP 19961009
5623 NaN MALE David P. Storch 1004 1996 AAR CORP 19961009
REJOIN LEFTOFC LEFTCO RELEFT REASON PAGE 1992 1993 1994 \
5622 NaN 19961001 19990531 NaN RESIGNED 79 1 0 0
5622 NaN 19961001 19990531 NaN RESIGNED 79 0 1 0
5622 NaN 19961001 19990531 NaN RESIGNED 79 0 0 1
5622 NaN 19961001 19990531 NaN RESIGNED 79 0 0 0
5622 NaN 19961001 19990531 NaN RESIGNED 79 0 0 0
5622 NaN 19961001 19990531 NaN RESIGNED 79 0 0 0
5622 NaN 19961001 19990531 NaN RESIGNED 79 0 0 0
5623 NaN NaN NaN NaN NaN 57 1 0 0
5623 NaN NaN NaN NaN NaN 57 0 1 0
5623 NaN NaN NaN NaN NaN 57 0 0 1
5623 NaN NaN NaN NaN NaN 57 0 0 0
5623 NaN NaN NaN NaN NaN 57 0 0 0
1995 1996 1997 1998
5622 0 0 0 0
5622 0 0 0 0
5622 0 0 0 0
5622 1 0 0 0
5622 0 1 0 0
5622 0 0 1 0
5622 0 0 0 1
5623 0 0 0 0
5623 0 0 0 0
5623 0 0 0 0
5623 1 0 0 0
5623 0 1 0 0
df = pd.concat([df.drop('YEAR', axis=1), pd.get_dummies(df['YEAR'])], axis=1)
如果你想删除YEAR
这一列,你可以接着使用del df['YEAR']
这行代码。或者,在调用concat
之前,先从df
中去掉YEAR
这一列: