如何使用sklearn柱变压器?

2024-05-14 07:54:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用LabelEncoder将分类值(在我的情况下是country列)转换为编码值,然后使用onehotecoder转换分类值。但是我收到了警告,比如OneHotEncoder的“分类功能”关键字被弃用“改用ColumnTransformer”,那么我如何使用ColumnTransformer来获得相同的结果?

下面是我的输入数据集和我尝试的代码

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

我得到的输出是,怎样才能得到与列变压器相同的输出

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

我试着遵循密码

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

我可以用上面的代码对country列进行编码,但是转换后x varible中缺少age和salary列


Tags: 代码import编码encoderas分类transformcountry
3条回答

@Fawwaz Yusran来处理这个警告。。。

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)

删除以下内容。。。

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

因为您直接使用一个hotecoder,所以不需要LabelEncoder。

很奇怪你想把连续的数据编码成工资。除非你把薪水限制在某个范围/类别内,否则这是没有意义的。如果我在你想做的地方:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

从这里开始,你可以用一个分类器,例如

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])  

按原样使用:

clf.fit(X_train,y_train)

这将应用预处理器,然后将转换后的数据传递给预测器。

我认为海报并不是要改变年龄和薪水。在文档(https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)中,column transformer(和make_column_transformer)只包含在transformer中指定的列(即示例中的[0])。您应该设置remainer=“passthrough”以获取其余列。换句话说:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)

相关问题 更多 >

    热门问题