迭代OLS模型使用Python Pandas和statsmodels运行得非常慢?(可能是数据帧使用不当!)

2024-05-16 11:47:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用Stats模型和Pandas来自动执行各种变量组合的线性回归的迭代过程。变量组合总数达到697343。这是一个非常多的OLS计算,但我不认为它会花很长时间(超过1小时)。X最大为18x18,Y始终为18X1。在

有人能告诉我,如果我正在使用的代码没有优化?有没有可能提出解决方案?在

import time
import pandas
import statsmodels.api as sm
perm = pandas.read_pickle('C:\SharedData\Temp\ResultTestDataframes\perm')
BB=pandas.read_pickle('C:\SharedData\Temp\ResultTestDataframes\BB')
wdb_demog=pandas.read_pickle("C:/SharedData/Temp/ResultTestDataframes/wdb_demog")
wdb_hts=pandas.read_pickle("C:/SharedData/Temp/ResultTestDataframes/wdb_hts")

result_db= pandas.DataFrame(columns=('R-squared value','Adj. R-squared','F-statistic','Prob (F-statistic)','coefficeints','Variables'))
row=-1
for v in range(len(perm)):
    row+=1
    variables_columns=list(set(perm.loc[v]))
    if None in variables_columns:
        variables_columns.remove(None)   
    X= pandas.DataFrame(BB[variables_columns]).values.tolist()
    Y= pandas.DataFrame(BB[wdb_hts.columns.values[1]]).values.tolist()    
    model = sm.OLS(Y,X)
    results = model.fit()
    R=[round(results.rsquared,4),
       round(results.rsquared_adj,4),
       round(results.fvalue,4),
       round(results.f_pvalue,4),
       list(results.params),
       list(variables_columns)] 
    result_db.loc[row]= pandas.Series(R, index=result_db.columns)

result_db.to_pickle("C:/SharedData/Temp/ResultTestDataframes/TEST")
print "done! " + time.strftime("%c")

--------------------

# BB is the DataFrame (18 rows × 90 columns ) 
# perm is the DataFrame (697343 × 17) that has all the combinations of variables' . The X  (exogenous variables) will be built using the given combination of variables and the  data in BB data frame   
# wdb_hts is another data frame to read the variables name to construct the Y (endogenous variables)

Tags: columnsthedataframepandasreadvariablestempresults