大家好
我正在用Python进行多元线性回归,并且必须证明多重共线性。我更习惯于在R中工作,我想知道是否有一种功能/方法可以复制R stats的别名功能。我想要的是一个输出,它显示了混杂变量是如何关联的
R的stats::alias()
给出一个输出,其中
rail_trail_head <- mosaicData::RailTrail %>% head(10)
alias(volume ~ ., rail_trail_head)
输出:
Model :
volume ~ hightemp + lowtemp + avgtemp + spring + summer + fall +
cloudcover + precip + weekday + dayType
Complete :
(Intercept) hightemp lowtemp spring cloudcover precip weekday1
avgtemp 0 1/2 1/2 0 0 0 0
summer 1 0 0 -1 0 0 0
fall 0 0 0 0 0 0 0
dayType1 0 0 0 0 0 0 -1
例如,这清楚地表明avgtemp
可以从hightemp
和lowtemp
计算出来
我在python中找到的最接近这一点的是来自statsmodels
:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
# data
rail_trail_head = pd.DataFrame({
'hightemp' : [83, 73, 74, 95, 44, 69, 66, 66, 80, 79],
'lowtemp' : [50, 49, 52, 61, 52, 54, 39, 38, 55, 45],
'avgtemp' : [66.5, 61, 63, 78, 48, 61.5, 52.5, 52, 67.5, 62],
'spring' : [0, 0, 1, 0, 1, 1, 1, 1, 0, 0],
'summer' : [1, 1, 0, 1, 0, 0, 0, 0, 1, 1],
'fall' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'cloudcover' : [7.59999990463257,
6.30000019073486,7.5,2.59999990463257,10,6.59999990463257,
2.40000009536743,0,3.79999995231628,
4.09999990463257],
'precip' : [0,0.28999999165535,
0.319999992847443,0,0.140000000596046,
0.0199999995529652,0,0,0,0],
'volume' : [501, 419, 397, 385, 200, 375, 417, 629, 533, 547],
'weekday' : [True,True,True,False,
True,True,True,False,False,True],
'dayType' : ["weekday","weekday",
"weekday","weekend","weekday","weekday","weekday",
"weekend","weekend","weekday"]
})
X = pd.get_dummies(
rail_trail_head
.drop(columns='volume')
.assign(weekday=rail_trail_head.weekday.astype('int'))
)
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]
print(vif_data)
出去
feature VIF
0 hightemp inf
1 lowtemp inf
2 avgtemp inf
3 spring inf
4 summer inf
5 fall NaN
6 cloudcover 9.999401
7 precip 1.446440
8 weekday inf
9 dayType_weekday inf
10 dayType_weekend inf
虽然这是信息性的,但它并没有告诉我变量是如何相关的,或者它们之间的关系
是否有来自R stats的alias()
函数的python等价物
# if you don't want to install.packages('mosaicData'):
rail_trail_head <- data.frame(
stringsAsFactors = FALSE,
hightemp = c(83, 73, 74, 95, 44, 69, 66, 66, 80, 79),
lowtemp = c(50, 49, 52, 61, 52, 54, 39, 38, 55, 45),
avgtemp = c(66.5, 61, 63, 78, 48, 61.5, 52.5, 52, 67.5, 62),
spring = c(0, 0, 1, 0, 1, 1, 1, 1, 0, 0),
summer = c(1, 1, 0, 1, 0, 0, 0, 0, 1, 1),
fall = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
cloudcover = c(7.59999990463257,
6.30000019073486,7.5,2.59999990463257,10,6.59999990463257,
2.40000009536743,0,3.79999995231628,
4.09999990463257),
precip = c(0,0.28999999165535,
0.319999992847443,0,0.140000000596046,
0.0199999995529652,0,0,0,0),
volume = c(501, 419, 397, 385, 200, 375, 417, 629, 533, 547),
weekday = c(TRUE,TRUE,TRUE,FALSE,
TRUE,TRUE,TRUE,FALSE,FALSE,TRUE),
dayType = c("weekday","weekday",
"weekday","weekend","weekday","weekday","weekday",
"weekend","weekend","weekday")
)
我意识到我不需要完整的公式来计算相关变量之间的关系,只是它们之间是相互关联的。因此,多加一列“组”就可以了。这表明-该组中的变量是共线的
在railtrail示例中,我需要的是类似于以下内容的输出:
目前没有回答
相关问题 更多 >
编程相关推荐