如何正确堆叠带自定义选择器的sklearn管道

1 投票
1 回答
33 浏览
提问于 2025-04-14 17:56

我想用sklearn里的StackingClassifier类来创建一个集成学习模型。

具体来说,我有两种不同类型的数据:二维图像和传统的表格数据。我想用一个MLP分类器来处理二维图像,用一个随机森林分类器来处理表格数据。然后,我想用一个随机森林分类器把这两个模型的预测结果结合起来,得到我的最终预测。

因为StackingClassifier没有办法给每个实例提供不同的训练数据,所以我想了个办法:我创建了两个Pipeline,并用一个自定义选择器从字典中返回值。虽然我可以分别训练和预测这两个管道,但当我尝试训练和预测堆叠模型时,却出现了ValueError: Found input variables with inconsistent numbers of samples的错误。

下面是我使用的代码:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

#Customized Selector 
class DictionarySelector(BaseEstimator, TransformerMixin):
    def __init__(self, in_dict, key):

        self.in_dict = in_dict
        self.key = key
        
    def fit(self, x, y = None):
        return(self)
    
    def transform(self, key):
        return(self.in_dict[self.key])
    
    
        

# Generate some example data 
# For demonstration purposes, I'm using the Iris dataset
data = load_iris()
X_images = data.data[:, :2]  # Let's say that these are my images
X_vector = data.data[:, 2:]  # Unidimensional vector 
y = data.target  # Target classes (3 classes)

# Split the data into train and test sets
X_images_train, X_images_test, X_vector_train, X_vector_test, y_train, y_test = train_test_split(
    X_images, X_vector, y, test_size=0.2, random_state=42
)

# Define the MLP for images
image_model = MLPClassifier(hidden_layer_sizes=(64, 32), activation="relu", max_iter=1000)

# Define the Random Forest for vectors
vector_model = RandomForestClassifier(n_estimators=100, criterion="gini", max_depth=None)
#Create dict with the two data types
in_dict = {'images' : X_images_train, 'vector' : X_vector_train}
#MLP pipeline for images
pipe_images = Pipeline([
    ('select', DictionarySelector(in_dict, 'images')),
    ('clf', image_model)
])

#check: it works
print(pipe_images.fit(X_images_train, y_train).predict(X_images_test))

#features pipeline
pipe_vector = Pipeline([
    ('select', DictionarySelector(in_dict, 'vector')),
    ('clf', vector_model)
])
#check: it works
print(pipe_images.fit(X_vector_train, y_train).predict(X_vector_test))

# Create a stacked model
stacked_model = StackingClassifier(
    estimators=[
        ("image_mlp", pipe_images),
        ("vector_rf", pipe_vector),
    ],
    final_estimator=RandomForestClassifier(n_estimators=100),
)

#The stacked model throw the ValueError
stacked_model.fit(in_dict, y_train)

1 个回答

1

你可以把表格数据和图像数据合并成一个大的矩阵,形状是 n_samples x (表格特征 + 图像特征)。然后创建两个处理流程,一个处理表格数据,另一个处理图像数据。这样,你就可以把相同的数据提供给这两个处理流程,它们会在内部选择自己需要的列,然后再进行分类。我已经把代码修改成这样了。

具体步骤如下:

  • 图像数据被整理成一个 n_samples x n_pixels 的矩阵
  • 表格数据被整理成一个 n_samples x n_features 的矩阵
  • 这两个矩阵被合并成一个更宽的矩阵,包含所有的列
  • 表格数据的处理流程是 [表格列选择器 -> 表格分类器]。选择器的部分是一个 ColumnTransformer,它会选择表格列并丢弃其他的列。图像数据的处理流程也是类似的定义。

表格数据、图像数据和堆叠分类器都使用相同的输入数据矩阵 (X_train, y_train)。

输出:

Checking pipe_image | score=0.900
  [1 0 2 1 2 0 1 2 1 1 2 0 0 0 0 2 2 1 1 2 0 1 0 2 2 2 2 2 0 0] 

Checking pipe_tabular | score=1.000
  [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0] 

Stacked accuracy: 1.000

橙色框是处理表格和图像数据的各自流程,蓝色轮廓是堆叠分类器。

这里输入图片描述


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

np.random.seed(0)

# Generate some example data 
# For demonstration purposes, I'm using the Iris dataset
data = load_iris()
X_images = data.data[:, :2]  # Let's say that these are my images
X_images_flat = X_images.reshape(len(X_images), -1) #flatten each sample from 2D to 1D if required

X_tabular = data.data[:, 2:]  # Unidimensional vector of tabular data 
y = data.target  # Target classes (3 classes)

X_images_and_tabular = np.concatenate([X_images_flat, X_tabular], axis=1)
images_col_indices = np.arange(X_images_flat.shape[1])
tabular_col_indices = np.arange(X_images_flat.shape[1], X_images_and_tabular.shape[1])

#
# Make 2 column selectors
#  Each transformer will "passthrough" the image or tabular columns and
#   "drop" the remaining columns, depending on whether it is selecting the
#   image data or tabular data
#
from sklearn.compose import ColumnTransformer

image_columns_selector = ColumnTransformer(
    [('image_columns_selector', 'passthrough', images_col_indices)],
    remainder='drop'
)

tabular_columns_selector = ColumnTransformer(
    [('tabular_columns_selector', 'passthrough', tabular_col_indices)],
    remainder='drop'
)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_images_and_tabular, y, test_size=0.2, random_state=42
)

# Define the MLP for images
image_model = MLPClassifier(hidden_layer_sizes=(64, 32), activation="relu", max_iter=3000)

# Define the Random Forest for tabular data
tabular_model = RandomForestClassifier(n_estimators=100, criterion="gini")

#MLP pipeline for images
from sklearn.pipeline import make_pipeline
pipe_image = make_pipeline(image_columns_selector, image_model)
image_pipe_acc = accuracy_score(
    y_test,
    pipe_image.fit(X_train, y_train).predict(X_test)
)
print(f'Checking pipe_image | score={image_pipe_acc:.3f}\n ', pipe_image.predict(X_test), '\n')

#features pipeline
pipe_tabular = make_pipeline(tabular_columns_selector, tabular_model)
tabular_pipe_acc = accuracy_score(
    y_test,
    pipe_tabular.fit(X_train, y_train).predict(X_test)
)
print(f'Checking pipe_tabular | score={tabular_pipe_acc:.3f}\n ', pipe_tabular.predict(X_test), '\n')

# Create a stacked model
stacked_model = StackingClassifier(
    estimators=[
        ("image_mlp", pipe_image),
        ("tabular_rf", pipe_tabular),
    ],
    final_estimator=RandomForestClassifier(n_estimators=100),
)

#The stacked model throw the ValueError
stacked_model.fit(X_train, y_train)
print(
    'Stacked accuracy:',
    '%5.3f' % stacked_model.score(X_test, y_test)
)

撰写回答