是否存在与R stats的别名函数相当的python?

2024-04-28 11:44:52 发布

您现在位置:Python中文网/ 问答频道 /正文

问题:

大家好

我正在用Python进行多元线性回归,并且必须证明多重共线性。我更习惯于在R中工作,我想知道是否有一种功能/方法可以复制R stats的别名功能。我想要的是一个输出,它显示了混杂变量是如何关联的

R

R的stats::alias()给出一个输出,其中

rail_trail_head <- mosaicData::RailTrail %>% head(10)

alias(volume ~ ., rail_trail_head)

输出:

Model :
volume ~ hightemp + lowtemp + avgtemp + spring + summer + fall + 
    cloudcover + precip + weekday + dayType

Complete :
         (Intercept) hightemp lowtemp spring cloudcover precip weekday1
avgtemp    0         1/2      1/2       0      0          0      0     
summer     1           0        0      -1      0          0      0     
fall       0           0        0       0      0          0      0     
dayType1   0           0        0       0      0          0     -1  

例如,这清楚地表明avgtemp可以从hightemplowtemp计算出来

Python

我在python中找到的最接近这一点的是来自statsmodels

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# data
rail_trail_head = pd.DataFrame({
    'hightemp' : [83, 73, 74, 95, 44, 69, 66, 66, 80, 79],
    'lowtemp' : [50, 49, 52, 61, 52, 54, 39, 38, 55, 45],
    'avgtemp' : [66.5, 61, 63, 78, 48, 61.5, 52.5, 52, 67.5, 62],
    'spring' : [0, 0, 1, 0, 1, 1, 1, 1, 0, 0],
    'summer' : [1, 1, 0, 1, 0, 0, 0, 0, 1, 1],
    'fall' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'cloudcover' : [7.59999990463257,
                          6.30000019073486,7.5,2.59999990463257,10,6.59999990463257,
                          2.40000009536743,0,3.79999995231628,
                          4.09999990463257],
    'precip' : [0,0.28999999165535,
                          0.319999992847443,0,0.140000000596046,
                          0.0199999995529652,0,0,0,0],
    'volume' : [501, 419, 397, 385, 200, 375, 417, 629, 533, 547],
    'weekday' : [True,True,True,False,
                          True,True,True,False,False,True],
    'dayType' : ["weekday","weekday",
                          "weekday","weekend","weekday","weekday","weekday",
                          "weekend","weekend","weekday"]
})

X = pd.get_dummies(
    rail_trail_head
    .drop(columns='volume')
    .assign(weekday=rail_trail_head.weekday.astype('int'))
)

# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
  
print(vif_data)

出去

            feature       VIF
0          hightemp       inf
1           lowtemp       inf
2           avgtemp       inf
3            spring       inf
4            summer       inf
5              fall       NaN
6        cloudcover  9.999401
7            precip  1.446440
8           weekday       inf
9   dayType_weekday       inf
10  dayType_weekend       inf

虽然这是信息性的,但它并没有告诉我变量是如何相关的,或者它们之间的关系

是否有来自R stats的alias()函数的python等价物

R的马赛克数据:

# if you don't want to install.packages('mosaicData'):
rail_trail_head <- data.frame(
  stringsAsFactors = FALSE,
          hightemp = c(83, 73, 74, 95, 44, 69, 66, 66, 80, 79),
           lowtemp = c(50, 49, 52, 61, 52, 54, 39, 38, 55, 45),
           avgtemp = c(66.5, 61, 63, 78, 48, 61.5, 52.5, 52, 67.5, 62),
            spring = c(0, 0, 1, 0, 1, 1, 1, 1, 0, 0),
            summer = c(1, 1, 0, 1, 0, 0, 0, 0, 1, 1),
              fall = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
           cloudcover = c(7.59999990463257,
                          6.30000019073486,7.5,2.59999990463257,10,6.59999990463257,
                          2.40000009536743,0,3.79999995231628,
                          4.09999990463257),
               precip = c(0,0.28999999165535,
                          0.319999992847443,0,0.140000000596046,
                          0.0199999995529652,0,0,0,0),
            volume = c(501, 419, 397, 385, 200, 375, 417, 629, 533, 547),
              weekday = c(TRUE,TRUE,TRUE,FALSE,
                          TRUE,TRUE,TRUE,FALSE,FALSE,TRUE),
              dayType = c("weekday","weekday",
                          "weekday","weekend","weekday","weekday","weekday",
                          "weekend","weekend","weekday")
   )

编辑

我意识到我不需要完整的公式来计算相关变量之间的关系,只是它们之间是相互关联的。因此,多加一列“组”就可以了。这表明-该组中的变量是共线的

在railtrail示例中,我需要的是类似于以下内容的输出:

^{tb1}$

Tags: truedataheadinftrailsummerspringvolume