在scikit-learn中为预计算核进行网格搜索的嵌套交叉验证

3 投票

2 回答

1548 浏览

提问于 2025-04-18 12:16

我有一个预先计算好的核矩阵，大小是NxN。我正在使用GridSearchCV来调整SVM的C参数，设置核函数为'预计算'，代码如下：

C_range = 10. ** np.arange(-2, 9)
param_grid = dict(C=C_range)
grid = GridSearchCV(SVC(kernel='precomputed'), param_grid=param_grid, cv=StratifiedKFold(y=data_label, n_folds=10))
grid.fit(kernel, data_label)
print grid.best_score_

这个方法运行得还不错，不过因为我在预测时使用了完整的数据（通过grid.predict(kernel)），所以出现了过拟合的情况（大多数时候我得到的精确度和召回率都是1.0）。

所以我想先把我的数据分成10份（9份用于训练，1份用于测试），使用交叉验证。在每一轮中，我想在训练集上运行GridSearch来调整C值，然后在测试集上进行测试。

为此，我把核矩阵切分成100x100和50x50的小矩阵，在其中一个小矩阵上运行grid.fit()，在另一个小矩阵上运行grid.predict()。

但是我遇到了以下错误：

ValueError: X.shape[1] = 50 should be equal to 100, the number of features at training time

我猜训练用的核矩阵应该和测试用的核矩阵维度相同，但我不明白为什么，因为我只是对100x100和50x50的矩阵计算np.dot(X, X.T)，所以最终的核矩阵维度是不同的……

过拟合 scikit-learn 模型评估交叉验证支持向量机网格搜索核矩阵预计算核

2 个回答

自定义网格搜索其实挺简单的，虽然到现在为止，我知道在sklearn里还是没有内置的方式来实现这个功能。下面是一个简单的代码片段，可以用来调整C参数：

import numpy as np
from sklearn.model_selection import ShuffleSplit
from sklearn.svm import SVC

def precomputed_kernel_GridSearchCV(K, y, Cs, n_splits=5, test_size=0.2, random_state=42):
    """A version of grid search CV, 
    but adapted for SVM with a precomputed kernel
    K (np.ndarray) : precomputed kernel
    y (np.array) : labels
    Cs (iterable) : list of values of C to try
    return: optimal value of C
    """
    from sklearn.model_selection import ShuffleSplit
 
    n = K.shape[0]
    assert len(K.shape) == 2
    assert K.shape[1] == n
    assert len(y) == n
    
    best_score = float('-inf')
    best_C = None
 
    indices = np.arange(n)
    
    for C in Cs:
        # for each value of parameter, do K-fold
        # The performance measure reported by k-fold cross-validation 
        # is the average of the values computed in the loop
        scores = []
        ss = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=random_state)
        for train_index, test_index in ss.split(indices):
            K_train = K[np.ix_(train_index,train_index)]
            K_test = K[np.ix_(test_index, train_index)]
            y_train = y[train_index]
            y_test = y[test_index]
            svc = SVC(kernel='precomputed', C=C)
            svc.fit(K_train, y_train)
            scores.append(svc.score(K_test, y_test))
        if np.mean(scores) > best_score:
            best_score = np.mean(scores)
            best_C = C
    return best_C

回答于 2025-04-18 由 Python大师

分享举报

scikit-learn的文档上说：

把核函数设置为'预计算'，然后在fit方法中传入Gram矩阵，而不是X。此时，必须提供所有训练向量和测试向量之间的核值。

所以我猜，用预计算的核函数进行（简单的）交叉验证是不可能的。

回答于 2025-04-18 由 Python大师

分享举报

在scikit-learn中为预计算核进行网格搜索的嵌套交叉验证

2 个回答

撰写回答