每列中有多个值的DataFrame。如何在主标题下对它们进行编码?

2024-06-12 17:26:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个DataFrame,每个列中都有一个变量列表。我不知道如何对每列中的数据进行热编码

In:

lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
    
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)
Out:

        A                     B                        C
Ella    Red, Blue, Yellow     Blue, Green, Yellow      Green, Red, Blue
Mike    Yellow, Red, Blue     Blue, Red, Green         Yellow, Blue, Red
Dave    Yellow, Red, Green    Red, Yellow, Blue        Green, Blue, Red

我希望通过多层列标题创建它,如下所示:

       A                                 B                               C
       Red    Blue   Green   Yellow      Red    Blue   Green   Yellow    ....
Ella   1      1      0       1           0      1      1       1         ....
Mike   1      1      0       1           1      1      1       0         ....   
Dave   1      0      1       1           1      1      0       1         ....                                                                                                                                                     

我将非常感谢一些指导,因为我已经在这上面停留了一段时间


Tags: 数据indataframe编码df列表greenblue
2条回答

这里有一个方法:

df = df.stack().str.get_dummies(sep=',')
df.columns = df.columns.str.strip()
df = df.stack().groupby(level=[0,1,2]).sum().unstack(level=[1,2])

有一个非常好的答案。在您的情况下,您必须将相同的应用于不同的列,因此类似(可以进一步优化):

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
    
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)

mlb = {}
res = {}
for column in df.columns:
    mlb[column] = MultiLabelBinarizer()

    res[column] = pd.DataFrame(mlb[column].fit_transform(df[column].apply(lambda x: [j.strip() for j in x.split(",")])),
                       columns=mlb[column].classes_,
                       index=df[column].index)

arrays = [np.concatenate(([np.array([column]*len(mlb[column].classes_)) for column in df.columns])),
          np.concatenate(([mlb[column].classes_ for column in df.columns]))]
df_end = pd.DataFrame(columns = arrays, index = [0,1,2])

for column in df.columns:
    df_end[column] = res[column]

df_end


    A                       B                     C
    Blue Green Red  Yellow  Blue Green Red Yellow Blue Green Red Yellow
0   1    0     1    1       1    1     0   1      1    1     1   0
1   1    0     1    1       1    1     1   0      1    0     1   1
2   0    1     1    1       1    0     1   1      1    1     1   0

相关问题 更多 >