Python中岭回归的方差膨胀因子

Question

我正在对一些有点相关的数据进行岭回归。为了找到一个稳定的拟合结果，我使用了一种叫做岭迹的方法，感谢scikit-learn上提供的很棒的例子，我成功地做到了这一点。还有另一种方法是计算每个变量的方差膨胀因子（VIF），随着k值的增加来观察。当VIF降到小于5时，说明拟合效果不错。Statsmodels提供了计算VIF的代码，但它是针对普通最小二乘回归（OLS）的。我尝试修改这段代码，以便它能处理岭回归。

我正在将我的结果与《回归分析实例》第五版第十章进行对比。我的代码在k = 0.000时生成了正确的结果，但之后就不对了。虽然有可用的SAS代码，但我不是SAS用户，也不知道它与scikit-learn（和/或statsmodels）之间的区别。

我在这个问题上卡了几天，所以任何帮助都会非常感激。

#http://www.ats.ucla.edu/stat/sas/examples/chp/chp_ch10.htm

from __future__ import division
import numpy as np
import pandas as pd
example = pd.read_csv('by_example_import.csv')
example.dropna(inplace=True)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(example)
scaler.transform(example)

X = example.drop(['year', 'import'], axis=1)
#c_matrix = X.corr()
y = example['import']
#w, v = np.linalg.eig(c_matrix)

import pylab as pl
from sklearn import linear_model

###############################################################################
# Compute paths

alphas = [0.000, 0.001, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.018,
          0.020, 0.022, 0.024, 0.026, 0.028, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080,
          0.090, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
clf = linear_model.Ridge(fit_intercept=False)
clf2 = linear_model.Ridge(fit_intercept=False)
coefs = []
vif_list = [[] for x in range(X.shape[1])]
for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X, y)
    coefs.append(clf.coef_)

    for j, data in enumerate(X.columns):
        cols = [col for col in X.columns if col not in [data]]
        Z = X[cols]
        yy = X.iloc[:,j]
        clf2.set_params(alpha=a)
        clf2.fit(Z, yy)

        r_squared_j = clf2.score(Z, yy)
        vif = 1. / (1. - r_squared_j)
        print r_squared_j
        vif_list[j].append(vif)

pd.DataFrame(vif_list, columns = alphas).T
pd.DataFrame(coefs, index=alphas)

###############################################################################
# Display results

ax = pl.gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])

ax.plot(alphas, coefs)
pl.vlines(ridge_cv.alpha_, np.min(coefs), np.max(coefs), linestyle='dashdot')
pl.xlabel('alpha')
pl.ylabel('weights')
pl.title('Ridge coefficients as a function of the regularization')
pl.axis('tight')
pl.show()

数据处理岭回归 scikit-learn statsmodels 回归分析变量选择方差膨胀因子稳定拟合

Python中岭回归的方差膨胀因子

1 个回答

撰写回答