我应该使用训练集还是验证集进行参数优化?
我正在用决策树训练一个模型,并且在优化参数。
我了解到,验证集的目的是在训练过程中评估模型的表现,并帮助调整参数。
既然这样,我是不是应该在 grid_search.fit
时使用验证集,而不是我的训练集呢?
param_grid = {
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")
#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")
#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")
1 个回答
1
根据scikit-learn的GridSearchCV()文档,你输入到这个函数的数据会自动被分成多个部分,并进行交叉验证。所以,你只需要提供完整的数据集(除了最后的训练数据),不需要自己去分割数据。
为了做到这一点,你可能想把训练数据和验证数据合并在一起:
import numpy as np
# Merge the training and validation datasets, for use in the GridSearchCV() function.
X_opt = np.vstack((X_train, X_val))
y_opt = np.hstack((y_train, y_val))
param_grid = {
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy') # This uses 5-fold cross-validation.
grid_search.fit(X_opt, y_opt) # Fit to the merged datasets.
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")
你的脚本会优化模型,使用大约80%的数据进行训练,其余的20%作为验证数据。这个过程会在不同的部分之间进行切换。通过使用上面修改过的代码,你可以确保充分利用你拥有的训练和验证数据,同时避免对测试数据进行优化。
你理解得没错,优化最好是针对验证数据进行,但训练始终必须在训练数据集上进行。GridSearchCV()函数基本上就是这么做的,它使用的是k-折交叉验证。
接下来,你需要分析网格搜索的结果,使用它处理过的实际部分:
# Analyse grid search.
best_clf = grid_search.best_estimator_
results = grid_search.cv_results_ # Access fold-specific results.
num_folds = grid_search.cv # Automatically find number of folds.
# Initialise a list to hold best fold-specific scores.
best_scores_per_fold = [float("-inf")] * num_folds
# Iterate over each fold.
for i in range(num_folds):
fold_key = f"split{i}_test_score"
# Find the best score for this fold.
best_score_for_fold = np.max(results[fold_key])
best_scores_per_fold[i] = best_score_for_fold
# Print the best scores per fold.
for i, score in enumerate(best_scores_per_fold, 1):
print(f"Best score for fold {i}: {score}")
# Test your model, on the testing data.
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")