如何删除无关紧要的分类交互术语Python StatsMod

2024-06-16 13:21:54 发布

您现在位置:Python中文网/ 问答频道 /正文

在统计模型中,很容易添加交互项。然而,并非所有的交互作用都是有意义的。我的问题是如何放下那些无关紧要的东西?例如库特奈机场。在

# -*- coding: utf-8 -*-
import pandas as pd
import statsmodels.formula.api as sm


if __name__ == "__main__":

    # Read data
    census_subdivision_without_lower_mainland_and_van_island = pd.read_csv('../data/augmented/census_subdivision_without_lower_mainland_and_van_island.csv')

    # Fit all data
    fit = sm.ols(formula="instagram_posts ~ airports * C(CNMCRGNNM) + ports_and_ferry_terminals + railway_stations + accommodations + visitor_centers + festivals + attractions + C(CNMCRGNNM) + C(CNSSSBDVS3)", data=census_subdivision_without_lower_mainland_and_van_island).fit()
    print(fit.summary())

enter image description here


Tags: andimportdataasvanlowerfitpd
2条回答

我试图重新创建一些数据,重点放在交互中的变量上。我不确定目标是仅仅获取值,还是需要特定的格式,但下面是一个如何使用pandas解决问题的示例(因为您在原始帖子中导入了pandas):

import pandas as pd
import statsmodels.formula.api as sm
np.random.seed(2)

df = pd.DataFrame()
df['instagram_posts'] = np.random.rand(50)
df['airports'] = np.random.rand(50)
df['CNMCRGNNM'] = np.random.choice(['Kootenay','Nechako','North Coast','Northeast','Thompson-Okanagan'],50)

fit = sm.ols(formula="instagram_posts ~ airports * C(CNMCRGNNM)",data=df).fit()
print(fit.summary())

这是输出:

^{pr2}$

将alpha更改为您喜欢的重要级别:

alpha = 0.05
df = pd.DataFrame(data = [x for x in fit.summary().tables[1].data[1:] if float(x[4]) < alpha], columns = fit.summary().tables[1].data[0])

Data framedf保存原始表中对alpha有意义的记录。在本例中,它是截距机场:C(CNMCRGNNM)[T.Nechako]。在

您可能还需要考虑逐个删除这些特性(从最无关紧要的特性开始)。这是因为一个特性可以根据另一个特性的缺失或存在而变得重要。下面的代码将为您做到这一点(我假设您已经定义了X和y):

import operator
import statsmodels.api as sm
import pandas as pd

def remove_most_insignificant(df, results):
    # use operator to find the key which belongs to the maximum value in the dictionary:
    max_p_value = max(results.pvalues.iteritems(), key=operator.itemgetter(1))[0]
    # this is the feature you want to drop:
    df.drop(columns = max_p_value, inplace = True)
    return df

insignificant_feature = True
while insignificant_feature:
        model = sm.OLS(y, X)
        results = model.fit()
        significant = [p_value < 0.05 for p_value in results.pvalues]
        if all(significant):
            insignificant_feature = False
        else:
            if X.shape[1] == 1:  # if there's only one insignificant variable left
                print('No significant features found')
                results = None
                insignificant_feature = False
            else:            
                X = remove_most_insignificant(X, results)
print(results.summary())

相关问题 更多 >