SkLearn 决策树：过拟合还是Bug？

0 投票

1 回答

1751 浏览

提问于 2025-04-18 08:00

我正在用sklearn的tree包分析我的决策树模型的训练误差和验证误差。

#compute the rms error
def compute_error(x, y, model):
 yfit = model.predict(x.toarray())
 return np.mean(y != yfit) 

def drawLearningCurve(model,xTrain, yTrain, xTest, yTest):
 sizes = np.linspace(2, 25000, 50).astype(int)
 train_error = np.zeros(sizes.shape)
 crossval_error = np.zeros(sizes.shape)

 for i,size in enumerate(sizes):

  model = model.fit(xTrain[:size,:].toarray(),yTrain[:size])

  #compute the validation error
  crossval_error[i] = compute_error(xTest,yTest,model)

  #compute the training error
  train_error[i] = compute_error(xTrain[:size,:],yTrain[:size],model)

from sklearn import tree
clf = tree.DecisionTreeClassifier()
drawLearningCurve(clf, xtr, ytr, xte, yte)

问题是（我不知道这算不算问题），如果我把决策树作为模型传给drawLearningCurve这个函数，我在每次循环中得到的训练误差都是0.0。这和我的数据集的特点有关吗，还是和sklearn的tree包有关？或者还有其他什么问题吗？

附注：在其他模型，比如朴素贝叶斯、KNN或人工神经网络中，训练误差绝对不是0.0。

决策树 sklearn 过拟合模型评估训练误差验证误差

1 个回答

这些建议提供了一些非常有用的方向。我想补充一下，你可能想调整的参数叫做 max_depth。

让我更担心的是你的 compute_error 函数有点奇怪。你得到的错误值是 0，这说明你的分类器在训练集上没有出错。但是，如果它真的犯了错误，你的错误函数却不会告诉你这一点。

import numpy as np
np.mean([0,0,0,0] != [0,0,0,0]) # perfect match, error is 0
0.0

np.mean([0,0,0,0] != [1, 1, 1, 1]) # 100% wrong answers
1.0

np.mean([0,0,0,0] != [1, 1, 1, 0]) # 75% wrong answers
1.0

np.mean([0,0,0,0] != [1, 1, 0, 0]) # 50% wrong answers
1.0

np.mean([0,0,0,0] != [1, 1, 2, 2]) # 50% wrong answers
1.0

你想要的应该是 np.sum(y != yfit)，或者更好的是，使用 sklearn 提供的一些错误函数，比如 accuracy_score。

回答于 2025-04-18 由 Python大师

分享举报

SkLearn 决策树：过拟合还是Bug？

1 个回答

撰写回答