如何正确堆叠带自定义选择器的sklearn管道
我想用sklearn
里的StackingClassifier
类来创建一个集成学习模型。
具体来说,我有两种不同类型的数据:二维图像和传统的表格数据。我想用一个MLP分类器来处理二维图像,用一个随机森林分类器来处理表格数据。然后,我想用一个随机森林分类器把这两个模型的预测结果结合起来,得到我的最终预测。
因为StackingClassifier
没有办法给每个实例提供不同的训练数据,所以我想了个办法:我创建了两个Pipeline
,并用一个自定义选择器从字典中返回值。虽然我可以分别训练和预测这两个管道,但当我尝试训练和预测堆叠模型时,却出现了ValueError: Found input variables with inconsistent numbers of samples
的错误。
下面是我使用的代码:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
#Customized Selector
class DictionarySelector(BaseEstimator, TransformerMixin):
def __init__(self, in_dict, key):
self.in_dict = in_dict
self.key = key
def fit(self, x, y = None):
return(self)
def transform(self, key):
return(self.in_dict[self.key])
# Generate some example data
# For demonstration purposes, I'm using the Iris dataset
data = load_iris()
X_images = data.data[:, :2] # Let's say that these are my images
X_vector = data.data[:, 2:] # Unidimensional vector
y = data.target # Target classes (3 classes)
# Split the data into train and test sets
X_images_train, X_images_test, X_vector_train, X_vector_test, y_train, y_test = train_test_split(
X_images, X_vector, y, test_size=0.2, random_state=42
)
# Define the MLP for images
image_model = MLPClassifier(hidden_layer_sizes=(64, 32), activation="relu", max_iter=1000)
# Define the Random Forest for vectors
vector_model = RandomForestClassifier(n_estimators=100, criterion="gini", max_depth=None)
#Create dict with the two data types
in_dict = {'images' : X_images_train, 'vector' : X_vector_train}
#MLP pipeline for images
pipe_images = Pipeline([
('select', DictionarySelector(in_dict, 'images')),
('clf', image_model)
])
#check: it works
print(pipe_images.fit(X_images_train, y_train).predict(X_images_test))
#features pipeline
pipe_vector = Pipeline([
('select', DictionarySelector(in_dict, 'vector')),
('clf', vector_model)
])
#check: it works
print(pipe_images.fit(X_vector_train, y_train).predict(X_vector_test))
# Create a stacked model
stacked_model = StackingClassifier(
estimators=[
("image_mlp", pipe_images),
("vector_rf", pipe_vector),
],
final_estimator=RandomForestClassifier(n_estimators=100),
)
#The stacked model throw the ValueError
stacked_model.fit(in_dict, y_train)
1 个回答
1
你可以把表格数据和图像数据合并成一个大的矩阵,形状是 n_samples x (表格特征 + 图像特征)
。然后创建两个处理流程,一个处理表格数据,另一个处理图像数据。这样,你就可以把相同的数据提供给这两个处理流程,它们会在内部选择自己需要的列,然后再进行分类。我已经把代码修改成这样了。
具体步骤如下:
- 图像数据被整理成一个
n_samples x n_pixels
的矩阵 - 表格数据被整理成一个
n_samples x n_features
的矩阵 - 这两个矩阵被合并成一个更宽的矩阵,包含所有的列
- 表格数据的处理流程是 [表格列选择器 -> 表格分类器]。选择器的部分是一个
ColumnTransformer
,它会选择表格列并丢弃其他的列。图像数据的处理流程也是类似的定义。
表格数据、图像数据和堆叠分类器都使用相同的输入数据矩阵 (X_train
, y_train
)。
输出:
Checking pipe_image | score=0.900
[1 0 2 1 2 0 1 2 1 1 2 0 0 0 0 2 2 1 1 2 0 1 0 2 2 2 2 2 0 0]
Checking pipe_tabular | score=1.000
[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Stacked accuracy: 1.000
橙色框是处理表格和图像数据的各自流程,蓝色轮廓是堆叠分类器。
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
np.random.seed(0)
# Generate some example data
# For demonstration purposes, I'm using the Iris dataset
data = load_iris()
X_images = data.data[:, :2] # Let's say that these are my images
X_images_flat = X_images.reshape(len(X_images), -1) #flatten each sample from 2D to 1D if required
X_tabular = data.data[:, 2:] # Unidimensional vector of tabular data
y = data.target # Target classes (3 classes)
X_images_and_tabular = np.concatenate([X_images_flat, X_tabular], axis=1)
images_col_indices = np.arange(X_images_flat.shape[1])
tabular_col_indices = np.arange(X_images_flat.shape[1], X_images_and_tabular.shape[1])
#
# Make 2 column selectors
# Each transformer will "passthrough" the image or tabular columns and
# "drop" the remaining columns, depending on whether it is selecting the
# image data or tabular data
#
from sklearn.compose import ColumnTransformer
image_columns_selector = ColumnTransformer(
[('image_columns_selector', 'passthrough', images_col_indices)],
remainder='drop'
)
tabular_columns_selector = ColumnTransformer(
[('tabular_columns_selector', 'passthrough', tabular_col_indices)],
remainder='drop'
)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X_images_and_tabular, y, test_size=0.2, random_state=42
)
# Define the MLP for images
image_model = MLPClassifier(hidden_layer_sizes=(64, 32), activation="relu", max_iter=3000)
# Define the Random Forest for tabular data
tabular_model = RandomForestClassifier(n_estimators=100, criterion="gini")
#MLP pipeline for images
from sklearn.pipeline import make_pipeline
pipe_image = make_pipeline(image_columns_selector, image_model)
image_pipe_acc = accuracy_score(
y_test,
pipe_image.fit(X_train, y_train).predict(X_test)
)
print(f'Checking pipe_image | score={image_pipe_acc:.3f}\n ', pipe_image.predict(X_test), '\n')
#features pipeline
pipe_tabular = make_pipeline(tabular_columns_selector, tabular_model)
tabular_pipe_acc = accuracy_score(
y_test,
pipe_tabular.fit(X_train, y_train).predict(X_test)
)
print(f'Checking pipe_tabular | score={tabular_pipe_acc:.3f}\n ', pipe_tabular.predict(X_test), '\n')
# Create a stacked model
stacked_model = StackingClassifier(
estimators=[
("image_mlp", pipe_image),
("tabular_rf", pipe_tabular),
],
final_estimator=RandomForestClassifier(n_estimators=100),
)
#The stacked model throw the ValueError
stacked_model.fit(X_train, y_train)
print(
'Stacked accuracy:',
'%5.3f' % stacked_model.score(X_test, y_test)
)