计算学生删除残差和离群值检验()以预测OL

2024-06-16 13:51:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用statmodel OLS进行迭代异常值消除。我已经把模型装上了

ols_result = sm.OLS(y,X).fit()

然后我可以得到外部和bonferroni的学生删除残差

ols_result.outlier_test(method="bonf")

我正在删除超过bonferroni p>;%的样品10在每次迭代中,厨师距离也是最高的。直到我没有得到bonf(p)>;%的样本10然后我得到了原始样本的子集

假设我有400个样本,剔除异常值后我得到380个样本。现在我想找到学生删除残差和bonferroni,关于回归拟合380个样本的400个样本。看看删除的异常值是否真的是异常值

问题就从这里开始。我在寻找一种简单的方法来使用statmodels-OLS模型来得到拟合值的残差和cooks距离,而不是自己编写那些函数。但是.outlier_test().get_influence()似乎适用于OLS结果对象

你们有什么简单的方法来实现这些测试而不需要太多的代码等

提前谢谢


Tags: 方法模型testgt距离result学生样本
1条回答
网友
1楼 · 发布于 2024-06-16 13:51:25

经过对这种实用性的深入研究。我找不到简单的方法来使用statmodels库。相反,我编写了自己的代码,也得到了statmodels源代码的帮助。如果有人需要我在下面分享的代码

def internally_studentized_residual(X,Y,y_hat):
    """

    Calculate studentized residuals internal

    Parameters
    ______________________________________________________________________

    X:            List
                  Variable x-axis

    Y:            List
                  Response variable of the X 


    y_hat:        List
                  Predictions of the response variable with the given X

    Returns
    ______________________________________________________________________

    Dataframe :   List
                  Studentized Residuals   


    """

#     print(len(Y))
    X = np.array(X, dtype=float)
    Y = np.array(Y, dtype=float)
    y_hat = np.array(y_hat,dtype=float)
    mean_X = np.mean(X)
    mean_Y = np.mean(Y)
    n = len(X)

#     print(X.shape,Y.shape,y_hat.shape)

    residuals = Y - y_hat

    X_inverse = np.linalg.pinv(X.reshape(-1,1))[0]
    h_ii = X_inverse.T * X

    Var_e = math.sqrt(sum((Y - y_hat) ** 2)/(n-2))
    SE_regression = Var_e/((1-h_ii) ** 0.5)
    studentized_residuals = residuals/SE_regression
    return studentized_residuals

def deleted_studentized_residual(X,Y,y_hat):
    """

    Calculate studentized residuals external

    Parameters
    ______________________________________________________________________

    X:            List
                  Variable x-axis

    Y:            List
                  Response variable of the X 


    y_hat:        List
                  Predictions of the response variable with the given X

    Returns
    ______________________________________________________________________

    Dataframe :   List
                  Studentized Residuals External  


    """
    #formula from https://newonlinecourses.science.psu.edu/stat501/node/401/
    r = internally_studentized_residual(X,Y,y_hat)
    n = len(r)
    return [r_i*math.sqrt((n-2-1)/(n-2-r_i**2)) for r_i in r]


def outlier_test(X,Y,y_hat,alpha=0.1):
    """

    outlier test for the points

    Parameters
    ______________________________________________________________________

    X:            List
                  Variable x-axis

    Y:            List
                  Response variable of the X 


    y_hat:        List
                  Predictions of the response variable with the given X

    alpha:        float
                  alpha value for multiple test

    Returns
    ______________________________________________________________________

    Dataframe :   studentized, unadjusted p values and benferroni multiple 
                  test dataframe  


    """
    resid = deleted_studentized_residual(X,Y,y_hat)
    df = len(X) - 1 
    p_vals = stats.t.sf(np.abs(resid),df) * 2
    bonf_test = multipletests(p_vals,alpha,method="bonf")
    df_result = pd.DataFrame()
    df_result.loc[:,"student_resid"] = resid
    df_result.loc[:,"unadj_p"] = p_vals
    df_result.loc[:,"bonf(p)"] = bonf_test[1]
    df_result.index = X.index

    return df_result

这可能是一个有点脏的代码。我没有时间去打扫和提高效率。它使用numpy模块和scipy.stats

相关问题 更多 >