使用sklearn在嵌套交叉验证中使用GroupKFold

from sklearn.datasets import load_iris from matplotlib import pyplot as plt from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV, cross_val_score, KFold,GroupKFold import numpy as np # Load the dataset iris = load_iris() X_iris = iris.data y_iris = iris.target # Set up possible values of parameters to optimize over p_grid = {"C": [1, 10, 100], "gamma": [.01, .1]} # We will use a Support Vector Classifier with "rbf" kernel svm = SVC(kernel="rbf") # Choose cross-validation techniques for the inner and outer loops, # independently of the dataset. # E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc. inner_cv = GroupKFold(n_splits=3) outer_cv = GroupKFold(n_splits=3) # Non_nested parameter search and scoring clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv) # Nested CV with parameter optimization nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, groups=y_iris)

.../anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: The 'groups' parameter should not be None.

3条回答

网友

1楼 · 编辑于 2024-04-19 13:17:38

对于现在回到这里的任何人，像我一样有兴趣将GroupKFold交叉验证传递到cross_val_score（）

cross_val_score（）分别接受cv=GroupKFold（）和groups参数

这就是我想要达到的目的

例如：

cv_outer = GroupKFold(n_splits=n_unique_groups)
groups = X['your_group_name'] # or pass your group another way

.... ML Code ...
    
scores = cross_val_score(search, X, y, scoring='f1', cv=cv_outer, groups = groups)

网友

2楼 · 编辑于 2024-04-19 13:17:38

我遇到了一个类似的问题，我发现@Samalama的解决方案很好。我唯一需要更改的是fit调用。我也必须把groups切成薄片，与火车组的X和y形状相同。否则，我会得到一个错误，说这三个对象的形状不一样。这是正确的实现吗

for train_index, test_index in outer_cv.split(x, y, groups=groups):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    grid = RandomizedSearchCV(estimator=model,
                                param_distributions=parameters_grid,
                                cv=inner_cv,
                                scoring=get_scoring(),
                                refit='roc_auc_scorer',
                                return_train_score=True,
                                verbose=1,
                                n_jobs=jobs)
    grid.fit(x_train, y_train, groups=groups[train_index])
    prediction = grid.predict(x_test)

网友

3楼 · 编辑于 2024-04-19 13:17:38

我一直在尝试使用GroupKFold实现嵌套CV，也尝试遵循您提到的sklearn提供的示例，最后也出现了与您相同的错误，找到了这个线程

我认为ywbaek的回答没有正确地解决这个问题

经过一些搜索，我发现sklearn Github上出现了一些问题，这些问题要么与这个特定问题有关，要么与同一问题的其他形式有关。我认为这与groups参数没有传播到所有方法有关（我试图跟踪脚本中失败的地方，但很快就丢失了）

以下是问题：

正如您可以看到的，这些可以追溯到某个时间（2016年10月）。我对开发不太了解，但很明显，解决这个问题并不是当务之急。我想这很好，但是嵌套CV的示例特别建议使用GroupKFold提供的方法，这是不可能的，因此应该更新

如果您仍然希望使用GroupKFold创建嵌套CV，当然还有其他方法。逻辑回归的一个例子：

from sklearn.model_selection import GridSearchCV, GroupKFold

pred_y = []
true_y = []

model = sklearn.linear_model.LogisticRegression()
Cs=[1,10,100]
p_grid={'C': Cs}

inner_CV = GroupKFold(n_splits = 4)
outer_CV = GroupKFold(n_splits = 4)

for train_index, test_index in outer_CV.split(X, y, groups=group):
    X_tr, X_tt = X[train_index,:], X[test_index,:]
    y_tr, y_tt = Y[train_index], Y[test_index]

    clf = GridSearchCV(estimator=model, param_grid=p_grid, cv=inner_CV)
    clf.fit(X_tr,y_tr,groups=group)

    pred = clf.predict(X_tt)   
    pred_y.extend(pred)
    true_y.extend(y_tt)

然后你可以根据你喜欢的事实来评估预测。当然，如果你仍然对比较嵌套分数和非嵌套分数感兴趣，你也可以收集我在这里没有做过的未嵌套分数

相关问题更多 >

编程相关推荐

热门问题

热门文章