<p>所以这个问题也困扰着我,虽然另一个提出了很好的观点,但他们没有回答OP问题的所有方面。</p>
<p>真正的答案是:增加k的分数差异是由于选择了度量R2(决定系数)。例如,对于MSE、MSLE或MAE,使用<code>cross_val_score</code>或<code>cross_val_predict</code>没有任何区别。</p>
<p>请参见<a href="https://en.wikipedia.org/wiki/Coefficient_of_determination" rel="nofollow noreferrer">definition of R2</a>:</p>
<p><em>R^2=1-(MSE(基本真实,预测)/MSE(基本真实,<strong>平均值(基本真实)</strong>)</em></p>
<p>粗体部分解释了为什么随着k值的增加,分数开始出现差异:分割越多,测试折叠中的样本就越少,测试折叠平均值的差异就越大。
相反,对于小k,测试折叠的平均值与全地面真值平均值相差不大,因为样本量仍然大到足以产生小的方差。</p>
<p>证明:</p>
<pre><code>import numpy as np
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_log_error as msle, r2_score
predictions = np.random.rand(1000)*100
groundtruth = np.random.rand(1000)*20
def scores_for_increasing_k(score_func):
skewed_score = score_func(groundtruth, predictions)
print(f'skewed score (from cross_val_predict): {skewed_score}')
for k in (2,4,5,10,20,50,100,200,250):
fold_preds = np.split(predictions, k)
fold_gtruth = np.split(groundtruth, k)
correct_score = np.mean([score_func(g, p) for g,p in zip(fold_gtruth, fold_preds)])
print(f'correct CV for k={k}: {correct_score}')
for name, score in [('MAE', mae), ('MSLE', msle), ('R2', r2_score)]:
print(name)
scores_for_increasing_k(score)
print()
</code></pre>
<p>输出为:</p>
<pre><code>MAE
skewed score (from cross_val_predict): 42.25333901481263
correct CV for k=2: 42.25333901481264
correct CV for k=4: 42.25333901481264
correct CV for k=5: 42.25333901481264
correct CV for k=10: 42.25333901481264
correct CV for k=20: 42.25333901481264
correct CV for k=50: 42.25333901481264
correct CV for k=100: 42.25333901481264
correct CV for k=200: 42.25333901481264
correct CV for k=250: 42.25333901481264
MSLE
skewed score (from cross_val_predict): 3.5252449697327175
correct CV for k=2: 3.525244969732718
correct CV for k=4: 3.525244969732718
correct CV for k=5: 3.525244969732718
correct CV for k=10: 3.525244969732718
correct CV for k=20: 3.525244969732718
correct CV for k=50: 3.5252449697327175
correct CV for k=100: 3.5252449697327175
correct CV for k=200: 3.5252449697327175
correct CV for k=250: 3.5252449697327175
R2
skewed score (from cross_val_predict): -74.5910282783694
correct CV for k=2: -74.63582817089443
correct CV for k=4: -74.73848598638291
correct CV for k=5: -75.06145142821893
correct CV for k=10: -75.38967601572112
correct CV for k=20: -77.20560102267272
correct CV for k=50: -81.28604960074824
correct CV for k=100: -95.1061197684949
correct CV for k=200: -144.90258384605787
correct CV for k=250: -210.13375041871123
</code></pre>
<p>当然,还有另一个效果没有在这里显示,这是其他人提到的。
随着k值的增加,有更多的模型在更多的样本上进行训练,而在更少的样本上进行验证,这将影响最终的得分,但这不是由<code>cross_val_score</code>和<code>cross_val_predict</code>之间的选择引起的。</p>