基于sklearn ColumnTransformer的预处理器在训练和测试数据集中输出不同的列

def preprocess_data(X): cat_var = X.select_dtypes(['bool','object']).columns num_var = X.select_dtypes(['int64','float64']).columns steps = [('c', Pipeline(steps=[('s',SimpleImputer(strategy='most_frequent')), ('oe',OneHotEncoder(handle_unknown='ignore'))]), cat_var), ('n', SimpleImputer(strategy='median'), num_var)] transformer = ColumnTransformer(transformers=steps, remainder='passthrough') X = transformer.fit_transform(X=X) return X

1条回答

网友

1楼 · 发布于 2024-04-27 05:06:30

您不能将此函数用于训练和测试集，因为这样您将两次拟合_变换。您需要使用训练数据拟合转换器，但只需转换测试数据。我建议为此使用sklearn管道，它会自动执行此过程，例如：

pipeline = Pipeline(
[
    ('preprocessing', preprocessor),
    ('clf', MLalgorithm())
]

)

网友

2楼 · 发布于 2024-04-27 05:06:30

若测试中的一个分类列在列中并没有类别，那个么您将得到更少的列。下面可以看到使用onehot转换的每个变量的分类级别数：

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

COL = x.select_dtypes(['bool','object']).columns
pd.DataFrame({'var':COL,
              'n_train':[len(train[i].unique()) for i in COL],
              'n_test':[len(test[i].unique()) for i in COL]})


    var n_train n_test
0   MSZoning    5   6
1   Street  2   2
2   Alley   3   3
3   LotShape    4   4
4   LandContour 4   4
5   Utilities   2   2
6   LotConfig   5   5
7   LandSlope   3   3
8   Neighborhood    25  25
9   Condition1  9   9
10  Condition2  8   5
11  BldgType    5   5
12  HouseStyle  8   7
13  RoofStyle   6   6
14  RoofMatl    8   4
15  Exterior1st 15  14
16  Exterior2nd 16  16
17  MasVnrType  5   5
18  ExterQual   4   4
19  ExterCond   5   5
20  Foundation  6   6
21  BsmtQual    5   5
22  BsmtCond    5   5
23  BsmtExposure    5   5
24  BsmtFinType1    7   7
25  BsmtFinType2    7   7
26  Heating 6   4
27  HeatingQC   5   5
28  CentralAir  2   2
29  Electrical  6   4
30  KitchenQual 4   5
31  Functional  7   8
32  FireplaceQu 6   6
33  GarageType  7   7
34  GarageFinish    4   4
35  GarageQual  6   5
36  GarageCond  6   6
37  PavedDrive  3   3
38  PoolQC  4   3
39  Fence   5   5
40  MiscFeature 5   4
41  SaleType    9   10
42  SaleCondition   6   6

您需要使用相同定义的类别对测试集变量进行分类，请参见the examples here

相关问题更多 >

编程相关推荐

热门问题

热门文章