我必须循环吗？有没有更快的方法来构建虚拟变量？

Unnamed: 0 plant att_1 att_2 ... 0 0 plant_a sunlover tall 1 1 plant_b waterlover sunlover 2 2 plant_c fast growing sunlover

Unnamed: 0 plant att_1 att_2 sunlover waterlover tall ... 0 0 plant_a sunlover tall 1 0 1 1 1 plant_b waterlover sunlover 1 1 0 2 2 plant_c fast growing sunlover 1 0 0

2条回答

网友

1楼 · 编辑于 2024-05-15 08:43:02

你所需要的只是在某些方面类似于获得假人，但你应该换一种方式

定义df的viev，仅限于“属性”列：

attCols = df[['att_1', 'att_2']]

在目标版本中，在此处添加其他“属性”列

然后定义包含唯一属性名称的索引：

colVals = pd.Index(np.sort(attCols.stack().unique()))

第三步是定义一个函数，计算当前行：

def myDummies(row):
    return pd.Series(colVals.isin(row).astype(int), index=colVals)

最后一步是加入这个函数的应用结果从附件到每行：

df = df.join(attCols.apply(myDummies, axis=1))

对于示例数据，结果是：

     plant         att_1     att_2  fast growing  sunlover  tall  waterlover
0  plant_a      sunlover      tall             0         1     1           0
1  plant_b    waterlover  sunlover             0         1     0           1
2  plant_c  fast growing  sunlover             1         1     0           0

网友

2楼 · 编辑于 2024-05-15 08:43:02

将^{}与max一起使用：

c = ['att_1', 'att_2']
df1 = df.join(pd.get_dummies(df[c], prefix='', prefix_sep='').max(axis=1, level=0))
print (df1)
     plant         att_1     att_2  fast growing  sunlover  waterlover  tall
0  plant_a      sunlover      tall             0         1           0     1
1  plant_b    waterlover  sunlover             0         1           1     0
2  plant_c  fast growing  sunlover             1         1           0     0

实际数据中3k行的性能应该不同：

df = pd.concat([df] * 1000, ignore_index=True)


In [339]: %%timeit
     ...: 
     ...: c = ['att_1', 'att_2']
     ...: df1 = df.join(pd.get_dummies(df[c], prefix='', prefix_sep='').max(axis=1, level=0))
     ...: 
     ...: 
10.7 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %%timeit
     ...: attCols = df[['att_1', 'att_2']]
     ...: colVals = pd.Index(np.sort(attCols.stack().unique()))
     ...: def myDummies(row):
     ...:     return pd.Series(colVals.isin(row).astype(int), index=colVals)
     ...: 
     ...: df1 = df.join(attCols.apply(myDummies, axis=1))
     ...: 
     ...: 
1.03 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

另一个解决方案：

In [133]: %%timeit
     ...: c = ['att_1', 'att_2']
     ...: df1 = (df.join(pd.DataFrame([dict.fromkeys(x, 1) for x in df[c].to_numpy()])
     ...:                  .fillna(0)
     ...:                  .astype(np.int8)))
     ...:                  
13.1 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

相关问题更多 >

编程相关推荐

热门问题

热门文章