哪个r平方分数更有帮助?

2024-06-16 12:11:04 发布

您现在位置:Python中文网/ 问答频道 /正文

   data.drop('Movie Title', axis=1, inplace=True)
   features = data.loc[:, data.columns != 'worldwide_gross_usd']
   charges = data['worldwide_gross_usd']

   X_train, X_test, y_train, y_test = train_test_split(features, 
                                                        charges, 
                                                        random_state=42, 
                                                        test_size = 0.2)
    
    regr = LinearRegression().fit(X_train, y_train)
    
    y_pred = regr.predict(X_test)
    
    print('Trained R-squared score: ', regr.score(X_train, y_train))
    print('Tested R-squared score: ', regr.score(X_test, y_test))

输出:

Trained R-squared score: 0.5404764241697003

Tested R-squaredscore: 0.5845801856343114

X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    charges, 
                                                    random_state=12, 
                                                    test_size = 0.2)

regr = LinearRegression().fit(X_train, y_train)

y_pred = regr.predict(X_test)

print('Trained R-squared score: ', regr.score(X_train, y_train))
print('Tested R-squared score: ', regr.score(X_test, y_test))

输出:

Trained R-squared score: 0.5345435646372121

Tested R-squaredscore: 0.602138324770633

正如您所注意到的,当我更改random_state值时,我的训练分数下降了1%,但我的测试分数增加了2% 你喜欢第一个还是第二个R平方分数


Tags: testdatatrainrandom分数worldwidescorefeatures
1条回答
网友
1楼 · 发布于 2024-06-16 12:11:04

R平方得分是回归模型的快速估计值,但不是一个好的估计值。

It is like:

You have 3 points on a 2D plane (say p1, p2, p3).

In 1st case, you plot regression line using p1 and p2, then test it on p3, and get r1 scores.

Nextly, you plot regression line using p2 and p3, then test it on p1, and get r2 scores.

So, you cannot fully depend on just R-squared score with different random state.

推论:

  1. 如果所有数据点都具有同等相关性,那么测试集上的R平方分数越高越好

  2. 如果您不确定数据集的相关性,那么您应该检查其他参数/方法,以找到哪个R平方分数更好

其他参数/方法:

您应该为这两种情况绘制剩余图。检查哪一个平均值接近零,方差接近1(对于大多数数据集),哪个更好。如果任何一种情况下的残差图都有某种模式,那么这种情况就不好,可以改进。如果任何情况下的残差图中有残差,则该情况也不好,可以改进

Note: For example, you want to predict house prices, and have data of area of house, location, BHK, number of people previously living there, etc. But house prices depends more on area of house rather number of people previously living there. So both are not equally relevant. This is what I mean by equally relevant.

相关问题 更多 >