我应该使用训练集还是验证集进行参数优化？

0 投票

1 回答

41 浏览

提问于 2025-04-13 00:22

我正在用决策树训练一个模型，并且在优化参数。

我了解到，验证集的目的是在训练过程中评估模型的表现，并帮助调整参数。

既然这样，我是不是应该在 grid_search.fit 时使用验证集，而不是我的训练集呢？

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")

#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")

决策树模型评估验证集参数优化

1 个回答

根据scikit-learn的GridSearchCV()文档，你输入到这个函数的数据会自动被分成多个部分，并进行交叉验证。所以，你只需要提供完整的数据集（除了最后的训练数据），不需要自己去分割数据。

为了做到这一点，你可能想把训练数据和验证数据合并在一起：

import numpy as np

# Merge the training and validation datasets, for use in the GridSearchCV() function.
X_opt = np.vstack((X_train, X_val))
y_opt = np.hstack((y_train, y_val))

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')  # This uses 5-fold cross-validation.
grid_search.fit(X_opt, y_opt)  # Fit to the merged datasets.
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

你的脚本会优化模型，使用大约80%的数据进行训练，其余的20%作为验证数据。这个过程会在不同的部分之间进行切换。通过使用上面修改过的代码，你可以确保充分利用你拥有的训练和验证数据，同时避免对测试数据进行优化。

你理解得没错，优化最好是针对验证数据进行，但训练始终必须在训练数据集上进行。GridSearchCV()函数基本上就是这么做的，它使用的是k-折交叉验证。

接下来，你需要分析网格搜索的结果，使用它处理过的实际部分：

# Analyse grid search.
best_clf = grid_search.best_estimator_
results = grid_search.cv_results_  # Access fold-specific results.
num_folds = grid_search.cv  # Automatically find number of folds.

# Initialise a list to hold best fold-specific scores.
best_scores_per_fold = [float("-inf")] * num_folds

# Iterate over each fold.
for i in range(num_folds):
    fold_key = f"split{i}_test_score"
    # Find the best score for this fold.
    best_score_for_fold = np.max(results[fold_key])
    best_scores_per_fold[i] = best_score_for_fold

# Print the best scores per fold.
for i, score in enumerate(best_scores_per_fold, 1):
    print(f"Best score for fold {i}: {score}")

# Test your model, on the testing data.
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")

回答于 2025-04-13 由 Python大师

分享举报

我应该使用训练集还是验证集进行参数优化？

1 个回答

撰写回答