Python管道在交叉验证中使用时返回NaN分数

from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.model_selection import cross_validate import numpy as np # Create random dataframe num_data = np.random.random_sample((5,4)) cat_data = ['good','bad','fair','excellent','bad'] col_list_stack = ['SalePrice','Id','TotalBsmtSF','GrdLivArea'] data = pd.DataFrame(num_data, columns = col_list_stack) data['Quality'] = cat_data X_train = data.drop(labels = ['SalePrice'], axis = 1) y_train = data['SalePrice'] #------------------------------------------------------------# # create a custom transformer to remove columns class ColumnsRemoval(BaseEstimator, TransformerMixin): def __init__(self, skip = False, remove_cols = ['Id','TotalBsmtSF']): self._remove_cols = remove_cols self._skip = skip def fit(self, X, y = None): return self def transform(self, X, y = None): if not self._skip: return X.drop(labels = self._remove_cols,axis = 1) else: return X #------------------------------------------------------------# # PIPELINE and cross-validation # Preprocessing steps common to numerical and categorical data preprocessor_common = Pipeline(steps=[ ('remove_features', ColumnsRemoval())]) # Separated preprocessing steps numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor_by_cat = ColumnTransformer( transformers=[ ('num', numeric_transformer, ['GrdLivArea']), ('cat', categorical_transformer, ['Quality'])], remainder = 'passthrough') # Full pipeline with model pipe = Pipeline(steps = [('preprocessor_common', preprocessor_common), ('preprocessor_by_cat', preprocessor_by_cat), ('model', LinearRegression())]) # Use cross validation to obtain scores scores = cross_validate(pipe, X_train, y_train, scoring = ["neg_mean_squared_error","r2"], cv = 4)

pipe = Pipeline(steps = [('preprocessor_common', preprocessor_common), ('preprocessor_by_cat', preprocessor_by_cat), ]) X_processed = pipe.fit_transform(X_train) # Use cross validation to obtain scores scores = cross_validate(LinearRegression(), X_processed, y_train, scoring = ["neg_mean_squared_error","r2"], cv = 4)

2条回答

网友

1楼 · 编辑于 2024-04-26 12:45:31

TL；DR

您需要重新定义自定义ColumnsRemoval的__init()__函数，因为传递Python列表作为默认值将导致错误。一种可能的解决办法：

class ColumnsRemoval(BaseEstimator, TransformerMixin):
    def __init__(self, skip=False, remove_cols=None):
        if remove_cols is None:
            remove_cols = ['Id', 'TotalBsmtSF']
        self._remove_cols = remove_cols
        self._skip = skip

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if not self._skip:
            return X.drop(labels=self._remove_cols, axis=1)
        else:
            return X

这样，您的管道就可以按预期工作了

背景

我运行了您的MWE，发现以下错误：

FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan.

它与您的自定义ColumnsRemoval的以下行相关：

return X.drop(labels=self._remove_cols, axis=1)

这引发了错误：

ValueError: Need to specify at least one of 'labels', 'index' or 'columns'

在将标准Python列表传递给drop()函数时，这似乎是一个已知的问题，本post将对此进行讨论。解决方案是传递numpy数组或pandas索引对象。我提出的另一个解决方案是，不要在函数定义中为remove_cols设置默认值，而是在函数体中分配它。这同样有效

似乎没有人真正知道为什么会发生这种情况。很抱歉，我无法详细说明实际原因（如果有人能补充，我会非常高兴）。但问题应该得到解决

网友

2楼 · 编辑于 2024-04-26 12:45:31

我找到了问题所在。我已经做了一些进一步的测试，还使用了float而不是列表作为默认值

如here所述，在实例部分下：

the object's attributes used in __init__() should have exactly the name of the argument in the constructor.

因此，我所做的是使用与在__init__()中传递的参数名称相同的对象属性名称，现在一切正常。例如：

class ColumnsRemoval(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = 0.9)
        self.threshold = threshold

使用self._threshold（注意threshold之前的_）有一种奇怪的行为，在某些情况下，对象与提供的值（或默认值）一起使用，但在其他情况下self._threshold被设置为None。这也允许使用list作为默认值来通过__init__()（尽管应该避免使用list作为默认值，有关详细信息，请参阅afsharov的回答）

相关问题更多 >

编程相关推荐

热门问题

热门文章