我目前正在处理一个包含大约2000个数据的二进制分类问题,并尝试实现嵌套交叉验证。我想知道这个实现是否正确。 作为一个例子,我将使用SVC算法。我将首先运行嵌套CV,并获得算法的实际性能估计。(该性能评估是构建最终集成的选择标准,最终将使用大约9种算法中的5种)。 因此,考虑到所有训练数据,我使用常规交叉验证获得最佳超参数,并最终形成SVC 3-5个最佳参数的集合,以最小化方差。最后,将不同的算法(SVM、AdaBoost、LogReg、XGBoost)组合成一个集合(规则投票和/或叠加)。货币评分是基于混淆矩阵(欺诈检测)的自定义评分函数。 下面你可以看到我的代码
嵌套CV
#Prepare nested CV
cv_outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
cv_inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
model = SVC(kernel="linear", random_state=rs)
params = {"C": [0.08, 0.09, 0.1, 0.11, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}
#Define grid search as outer cv
grid = GridSearchCV(estimator=model, param_grid=params, scoring=monetary_score, cv=cv_outer, n_jobs=-1)
#Get the mean nested score with inner cv
nested_cv = cross_validate(estimator=grid, X=X, y=y, scoring=monetary_score, cv=cv_inner, return_estimator=True, n_jobs=-1)
nested_score = nested_cv["test_score"].mean()
print(nested_score)
0.3361904761904762
常规网格搜索
grid.fit(X, y)
means = grid.cv_results_["mean_test_score"]
stds = grid.cv_results_["std_test_score"]
ranks = grid.cv_results_["rank_test_score"]
for rank, mean, params in zip(ranks, means, grid.cv_results_["params"]):
print(rank, "\t", mean, "\t", params)
print(f"\nBest params:\t{grid.best_params_}")
print(f"Best score:\t{grid.best_score_}\n")
9 0.35428571428571426 {'C': 0.08}
9 0.35428571428571426 {'C': 0.09}
12 0.27904761904761904 {'C': 0.1}
11 0.31714285714285717 {'C': 0.11}
7 0.38619047619047614 {'C': 0.2}
6 0.39571428571428574 {'C': 0.3}
2 0.41238095238095235 {'C': 0.4}
2 0.41238095238095235 {'C': 0.5}
2 0.41238095238095235 {'C': 0.6}
2 0.41238095238095235 {'C': 0.7}
8 0.38142857142857145 {'C': 0.8}
1 0.4514285714285714 {'C': 0.9}
Best params: {'C': 0.9}
Best score: 0.4514285714285714
构建最佳5个参数的集合
estimators = [
("svc1", SVC(C=0.9, kernel="linear", random_state=17)),
("svc2", SVC(C=0.7, kernel="linear", random_state=17)),
("svc3", SVC(C=0.6, kernel="linear", random_state=17)),
("svc4", SVC(C=0.4, kernel="linear", random_state=17)),
("svc5", SVC(C=0.5, kernel="linear", random_state=17))
]
final_clf = VotingClassifier(estimators, voting="hard")
#How good does our model perform based on cross validation?
scoring = {'monetary_score': monetary_score,
'accuracy': 'accuracy',
'f1': 'f1',
'auc': make_scorer(roc_auc_score)
}
scores = cross_validate(final_clf, X, y, cv=cv_inner, scoring=scoring, n_jobs=-1)
print({scores['test_monetary_score'].mean()})
0.41238095238095235
目前没有回答
相关问题 更多 >
编程相关推荐