如何在scikit-learn的`pipeline`中使用自定义特征选择函数

14 投票

6 回答

10824 浏览

提问于 2025-04-18 16:52

假设我想比较不同的降维方法，针对一个包含多个特征（n>2）的监督学习数据集，我打算通过交叉验证和使用pipeline类来进行比较。

比如说，如果我想试试主成分分析（PCA）和线性判别分析（LDA），我可以这样做：

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classification', GaussianNB())   
    ])

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classification', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classification', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=10)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=10,           # number of folds the dataset is divided into
           shuffle=True,
           random_state=123)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_pca, clf_lda]
    ]

但是现在，假设我根据一些“领域知识”有个想法，认为特征3和特征4可能是“好特征”（也就是数组X_train的第三列和第四列），我想把它们和其他方法进行比较。

我该如何在pipeline中加入这样的手动特征选择呢？

比如说

def select_3_and_4(X_train):
    return X_train[:,2:4]

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_select', select_3_and_4),           
    ('classification', GaussianNB())   
    ])

显然这样是行不通的。

所以我想我需要创建一个特征选择的类，这个类里有一个transform的虚拟方法和一个fit方法，这个方法返回那两列numpy数组？还是说有更好的方法呢？

自定义类数据预处理特征选择交叉验证主成分分析线性判别分析监督学习降维

6 个回答

我觉得之前的答案不太清楚，所以我来分享一下我的解决方案，供大家参考。

基本上，这个思路是创建一个新的类，基于 BaseEstimator 和 TransformerMixin。

下面这个是一个特征选择器，它是根据某一列中缺失值（NAs）的百分比来工作的。perc 这个值就是表示缺失值的百分比。

from sklearn.base import TransformerMixin, BaseEstimator

class NonNAselector(BaseEstimator, TransformerMixin):

    """Extract columns with less than x percentage NA to impute further
    in the line
    Class to use in the pipline
    -----
    attributes 
    fit : identify columns - in the training set
    transform : only use those columns
    """

    def __init__(self, perc=0.1):
        self.perc = perc
        self.columns_with_less_than_x_na_id = None

    def fit(self, X, y=None):
        self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
        return self

    def transform(self, X, y=None, **kwargs):
        return X[self.columns_with_less_than_x_na_id]

    def get_params(self, deep=False):
        return {"perc": self.perc}

回答于 2025-04-18 由 Python大师

分享举报

你可以使用下面这个自定义转换器来选择你指定的列：

#自定义转换器，用于提取作为参数传递给它的构造函数的列

class FeatureSelector( BaseEstimator, TransformerMixin ):

    #Class Constructor 
    def __init__( self, feature_names ):
        self._feature_names = feature_names 

    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 

    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        return X[ self._feature_names ]`

这里的 feature_names 是你想要选择的特征列表。想了解更多细节，可以参考这个链接：

https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

回答于 2025-04-18 由 Python大师

分享举报

在Sebastian Raschka和eickenberg的回答基础上，关于一个变换器对象应该具备的要求，scikit-learn的文档中有详细说明。

如果你想让这个估计器可以用于参数估计，除了需要有fit和transform这两个方法外，还有其他一些要求，比如需要实现set_params。

回答于 2025-04-18 由 Python大师

分享举报

我只是想分享一下我的解决方案，可能对某些人有用：

class ColumnExtractor(object):

    def transform(self, X):
        cols = X[:,2:4] # column 3 and 4 are "extracted"
        return cols

    def fit(self, X, y=None):
        return self

然后，可以像这样在 Pipeline 中使用：

clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', ColumnExtractor()),           
    ('classification', GaussianNB())   
    ])

编辑：通用解决方案

如果你想选择并堆叠多个列，这里有一个更通用的解决方案，你可以使用以下这个类：

import numpy as np

class ColumnExtractor(object):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

    clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
    ('classification', GaussianNB())   
    ])

回答于 2025-04-18 由 Python大师

分享举报

如果你想使用 Pipeline 对象，那最好的方法就是写一个转换器对象。还有一种比较“脏”的方法是

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

然后像你在管道中那样使用 select_3_and_4。你也可以写一个类来实现这个功能。

另外，如果你知道其他特征不重要的话，你也可以直接把 X_train[:, 2:4] 传给你的管道。

数据驱动的特征选择工具可能有点偏题，但总是很有用的：比如可以看看 sklearn.feature_selection.SelectKBest，配合 sklearn.feature_selection.f_classif 或 sklearn.feature_selection.f_regression，在你的情况下可以设置 k=2。

回答于 2025-04-18 由 Python大师

分享举报

如何在scikit-learn的`pipeline`中使用自定义特征选择函数

6 个回答

编辑：通用解决方案

撰写回答