无法进行网格搜索和训练模型

0 投票
1 回答
21 浏览
提问于 2025-04-14 17:26

我正在做一个基本的文本分类问题,我想用一种叫做堆叠分类器的方法,并对我的基础分类器的参数进行一些微调,以获得更高的准确率。

我的数据集有8000行和2列(文本和类别)。下面这段代码似乎卡住了,而我对这个领域还不太熟悉(我是初学者),所以没法找到问题所在。

import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import NuSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, log_loss, classification_report, confusion_matrix

# Define parameter grids for classifiers
param_grid_nusvc = {
    'nu': [0.1, 0.3, 0.5, 0.7, 0.9],
    'kernel': ['linear', 'rbf'],
}

param_grid_logreg = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2'],
}

# Perform grid search for classifiers with improved clarity
nusvc_grid_search = GridSearchCV(NuSVC(probability=True), param_grid_nusvc, cv=2, scoring='accuracy')  # Use accuracy scoring
logreg_grid_search = GridSearchCV(LogisticRegression(), param_grid_logreg, cv=2, scoring='accuracy')

nusvc_grid_search.fit(X_train, y_train)
logreg_grid_search.fit(X_train, y_train)

# Get best parameters
best_params_nusvc = nusvc_grid_search.best_params_
best_params_logreg = logreg_grid_search.best_params_

# Set up base classifiers with best parameters
best_nusvc = NuSVC(probability=True, **best_params_nusvc)
best_logreg = LogisticRegression(**best_params_logreg)

# Setting up stacking classifier
sc = StackingClassifier(
    estimators=[
        ('NuSVC', best_nusvc),
        ('LDA', LinearDiscriminantAnalysis())
    ],
    final_estimator=best_logreg
)

sc.fit(X_train, y_train)

# Evaluate the combined classifiers
print('****Results****')
train_predictions = sc.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print("Accuracy: {:.4%}".format(acc))

train_predictions_proba = sc.predict_proba(X_test)
ll = log_loss(y_test, train_predictions_proba)
print("Log Loss: {}".format(ll))

# Print classification report (optional)
print('\nClassification Report:')
print(classification_report(y_test, train_predictions))

# Print confusion matrix (optional)
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, train_predictions))

我根据chatGPT的建议对上面的代码做了一些修改,想知道如何通过网格搜索来微调参数。但现在代码似乎卡住了(大约20分钟)。而没有使用网格搜索时,运行大约只需2-3分钟就能完成。

1 个回答

0

你的SVC网格有5×2个点,每个点适合2个折叠,所以这大概会花费20倍的时间。你可以在搜索时设置verbose=4,这样可以更好地跟踪发生了什么。同时,可以考虑使用并行处理(比如设置n_jobs=-1)。

撰写回答