python中的嵌套交叉验证—正确用法回顾

2024-05-29 00:11:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在处理一个包含大约2000个数据的二进制分类问题,并尝试实现嵌套交叉验证。我想知道这个实现是否正确。 作为一个例子,我将使用SVC算法。我将首先运行嵌套CV,并获得算法的实际性能估计。(该性能评估是构建最终集成的选择标准,最终将使用大约9种算法中的5种)。 因此,考虑到所有训练数据,我使用常规交叉验证获得最佳超参数,并最终形成SVC 3-5个最佳参数的集合,以最小化方差。最后,将不同的算法(SVM、AdaBoost、LogReg、XGBoost)组合成一个集合(规则投票和/或叠加)。货币评分是基于混淆矩阵(欺诈检测)的自定义评分函数。 下面你可以看到我的代码

嵌套CV

#Prepare nested CV
cv_outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
cv_inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
   
model = SVC(kernel="linear", random_state=rs)
params = {"C": [0.08, 0.09, 0.1, 0.11, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}
    
#Define grid search as outer cv
grid = GridSearchCV(estimator=model, param_grid=params, scoring=monetary_score, cv=cv_outer, n_jobs=-1)

#Get the mean nested score with inner cv
nested_cv = cross_validate(estimator=grid, X=X, y=y, scoring=monetary_score, cv=cv_inner, return_estimator=True, n_jobs=-1)
nested_score = nested_cv["test_score"].mean()

print(nested_score)
0.3361904761904762

常规网格搜索

grid.fit(X, y)
means = grid.cv_results_["mean_test_score"]
stds = grid.cv_results_["std_test_score"]
ranks = grid.cv_results_["rank_test_score"]
for rank, mean, params in zip(ranks, means, grid.cv_results_["params"]):
    print(rank, "\t", mean, "\t", params)
   
print(f"\nBest params:\t{grid.best_params_}")
print(f"Best score:\t{grid.best_score_}\n")

9    0.35428571428571426     {'C': 0.08}
9    0.35428571428571426     {'C': 0.09}
12   0.27904761904761904     {'C': 0.1}
11   0.31714285714285717     {'C': 0.11}
7    0.38619047619047614     {'C': 0.2}
6    0.39571428571428574     {'C': 0.3}
2    0.41238095238095235     {'C': 0.4}
2    0.41238095238095235     {'C': 0.5}
2    0.41238095238095235     {'C': 0.6}
2    0.41238095238095235     {'C': 0.7}
8    0.38142857142857145     {'C': 0.8}
1    0.4514285714285714      {'C': 0.9}

Best params:    {'C': 0.9}
Best score: 0.4514285714285714

构建最佳5个参数的集合

estimators = [
    ("svc1", SVC(C=0.9, kernel="linear", random_state=17)),
    ("svc2", SVC(C=0.7, kernel="linear", random_state=17)),
    ("svc3", SVC(C=0.6, kernel="linear", random_state=17)),
    ("svc4", SVC(C=0.4, kernel="linear", random_state=17)),
    ("svc5", SVC(C=0.5, kernel="linear", random_state=17))
            ]

final_clf = VotingClassifier(estimators, voting="hard")
#How good does our model perform based on cross validation?
scoring = {'monetary_score': monetary_score,
           'accuracy': 'accuracy',
           'f1': 'f1',
           'auc': make_scorer(roc_auc_score)
           }

scores = cross_validate(final_clf, X, y, cv=cv_inner, scoring=scoring, n_jobs=-1)
print({scores['test_monetary_score'].mean()})
0.41238095238095235

Tags: testrandomparamsmeankernelcvgridnested

热门问题