如何在python中结合LabelBinarizer和onehotcoder作为分类变量?

2024-05-29 05:17:03 发布

您现在位置:Python中文网/ 问答频道 /正文

在过去的几天里,我在stackoverflow上查找了正确的教程和问答,但没有找到正确的指南,主要是因为显示LabelBinarizer或onehotcoder的用例的示例没有显示它是如何被集成到管道中的,反之亦然。在

我有一个包含4个变量的数据集:

num1    num2    cate1    cate2
3       4       Cat      1
9       23      Dog      0
10      5       Dog      1

num1和num2是数值变量,cate1和cate2是范畴变量。我知道我需要在拟合ML算法之前对分类变量进行编码,但我不太确定在多次尝试之后如何在管道中进行编码。在

^{pr2}$

这给了我一个错误ValueError: could not convert string to float: 'Cat'

将最后第4行替换为

('categorical', make_pipeline(Columns(names=X_cat_cols),OneHotEncoder()))

会给我同样的ValueError: could not convert string to float: 'Cat'。在

将最后第4行替换为

('categorical', make_pipeline(Columns(names=X_cat_cols),LabelBinarizer(),OneHotEncoder()))
])),

会给我一个不同的错误TypeError: fit_transform() takes 2 positional arguments but 3 were given。在

将最后第4行替换为

('numeric', make_pipeline(Columns(names=X_num_cols),LabelBinarizer())),

会给我这个错误TypeError: fit_transform() takes 2 positional arguments but 3 were given。在


Tags: columns编码make管道pipelinenames错误cat
2条回答

接受了马库斯的建议,我试图安装scikit learn dev版本,但却发现了一个类似的东西,叫做category_encoders。在

将代码改为以下工作:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
    ])),
    ('LR_model', LogisticRegression()),
])

至于我,我更喜欢使用LabelEncoder。 只是玩具的例子。在

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn import linear_model

df= pd.DataFrame({ 'y': [10,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ],
                  'b': ['a', 'b','a', 'b','a', 'b','b', 'b' ],  'c': ['a', 'b','a', 'a','a', 'b','b', 'b' ]})
df

我定义class来选择列

^{pr2}$

现在我定义了用LabelEncoder进行预处理的类

lb = df[['a', 'c']]
class MyLEncoder():

    def transform(self, X, **fit_params):
        enc = preprocessing.LabelEncoder()
        enc_data = []
        for i in list(lb.columns):
            encc = enc.fit(lb[i])
            enc_data.append(encc.transform(X[i]))

        return np.asarray(enc_data).T

    def fit_transform(self, X,y=None,  **fit_params):
        self.fit(X,y,  **fit_params)
        return self.transform(X)

    def fit(self, X, y, **fit_params):
        return self

我使用for-loop,因为我们可以将LabelEncoder应用于单个向量。 管道

X = df[['a', 'b', 'c']]
y = df['y']
regressor = linear_model.SGDRegressor()

pipeline = Pipeline([
    # Use FeatureUnion to combine the features
    ('union', FeatureUnion(
        transformer_list=[
             # categorical
            ('categorical', Pipeline([
                 ('selector', MultiColumn(columns=['a', 'c'])),
                ('one_hot', MyLEncoder())
            ])),
        ])),
    # Use a regression
    ('model_fitting', linear_model.SGDRegressor()),
])
pipeline.fit(X, y)
pipeline.predict(X)

并检查新数据

new= pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ],'c': ['b', 'a' ], 'b': [3, 6],})
pipeline.predict(new)

类似地,我们可以为任何预处理分类数据的方法做。在

相关问题 更多 >

    热门问题