如何在python中结合LabelBinarizer和onehotcoder作为分类变量？

2条回答

网友

1楼 · 编辑于 2024-05-29 05:17:03

接受了马库斯的建议，我试图安装scikit learn dev版本，但却发现了一个类似的东西，叫做category_encoders。在

将代码改为以下工作：

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer
import category_encoders as CateEncoder

# Class that identifies Column type
class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names
    def fit (self, X, y=None, **fit_params):
        return self
    def transform(self, X):
        return X[self.names]

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

X_selected = X.filter(['num1', 'num2', 'cate1', 'cate2'])

# from the selected X, further choose categorical only
X_selected_cat = X_selected.filter(['cate1', 'cate2']) # hand selected since some cat var has value 0, 1

# Find the numerical columns, exclude categorical columns
X_num_cols = X_selected.columns[X_selected.dtypes.apply(lambda c: np.issubdtype(c, np.number))] # list of numeric column names, automated here
X_cat_cols = X_selected_cat.columns # list of categorical column names, previously hand-slected

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, 
                                                    test_size=0.5, 
                                                    random_state=567, 
                                                    stratify=y)

# Pipeline
pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=X_num_cols),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=X_cat_cols),CateEncoder.BinaryEncoder()))
    ])),
    ('LR_model', LogisticRegression()),
])

网友

2楼 · 编辑于 2024-05-29 05:17:03

至于我，我更喜欢使用LabelEncoder。只是玩具的例子。在

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn import linear_model

df= pd.DataFrame({ 'y': [10,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ],
                  'b': ['a', 'b','a', 'b','a', 'b','b', 'b' ],  'c': ['a', 'b','a', 'a','a', 'b','b', 'b' ]})
df

我定义class来选择列

^{pr2}$

现在我定义了用LabelEncoder进行预处理的类

lb = df[['a', 'c']]
class MyLEncoder():

    def transform(self, X, **fit_params):
        enc = preprocessing.LabelEncoder()
        enc_data = []
        for i in list(lb.columns):
            encc = enc.fit(lb[i])
            enc_data.append(encc.transform(X[i]))

        return np.asarray(enc_data).T

    def fit_transform(self, X,y=None,  **fit_params):
        self.fit(X,y,  **fit_params)
        return self.transform(X)

    def fit(self, X, y, **fit_params):
        return self

我使用for-loop，因为我们可以将LabelEncoder应用于单个向量。管道

X = df[['a', 'b', 'c']]
y = df['y']
regressor = linear_model.SGDRegressor()

pipeline = Pipeline([
    # Use FeatureUnion to combine the features
    ('union', FeatureUnion(
        transformer_list=[
             # categorical
            ('categorical', Pipeline([
                 ('selector', MultiColumn(columns=['a', 'c'])),
                ('one_hot', MyLEncoder())
            ])),
        ])),
    # Use a regression
    ('model_fitting', linear_model.SGDRegressor()),
])
pipeline.fit(X, y)
pipeline.predict(X)

并检查新数据

new= pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ],'c': ['b', 'a' ], 'b': [3, 6],})
pipeline.predict(new)

类似地，我们可以为任何预处理分类数据的方法做。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在python中结合LabelBinarizer和onehotcoder作为分类变量？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >