多索引数据帧创建加速

2024-04-20 12:19:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个格式为(9k条目的)的数据框,将所有其他不相关的列分组到其他列:

print(df_test[0:4])
   index  TF-Objektnummer Date                           other_cols
0      0          4259619 1970-01-01 13:45:41.000014557  xx
1      1          4186279 2014-10-20 06:42:23.000056098  yy
2      2          4185787 2014-10-16 06:18:56.000067086  zz
3      3          4259599 1970-01-01 13:03:59.000083584  kk

日期可以是带有随机时间(从fillna)的unix epoch,也可以是安装日期时间。你知道吗

我想要得到的是一个多索引数据帧,其中0级索引是TF Objektnummer,1级索引是一个日期时间范围,定义方式如下(inter-being:inter=df)_测试.loc[ind]对于给定的ind):

minff=pd.to_datetime('2001-01-07 00:00:00')
maxMD=pd.to_datetime('2017-08-31 08:33:34.000057100')
if inter["Date"]< minff:
    start=minff
else:
    start=inter["Date"]
dt_index=pd.date_range(start=start,end=maxMD)

换句话说,对于每个TF Objektnummer,我想创建一行,在level1上按Date到maxMD范围内的所有日期建立索引,另外我想添加level1索引的year、day、Season等列。 看起来是这样的:

                                       index TF-Objektnummer Date                          other_cols Year Month Day DoW Season Hour
TF-objNr Date                                                                  
4259619  2001-01-07 00:00:00.000000000     0         4259619 1970-01-01 13:45:41.000014557 xx         2001     1   7   6      0    0  
         2001-01-08 00:00:00.000000000     0         4259619 1970-01-01 13:45:41.000014557 xx         2001     1   8   0      0    0
         ...
         ...
         2017-08-30 00:00:00.000000000     0         4259619 1970-01-01 13:45:41.000014557 xx         2017     8  30   2      2    0
         2017-08-31 00:00:00.000000000     0         4259619 1970-01-01 13:45:41.000014557 xx         2017     8  31   3      2    0
...
...
4185787  2014-10-16 06:18:56.000067086     2         4185787 2014-10-16 06:18:56.000067086 zz         2014    10  16   3      3     6
         2014-10-17 06:18:56.000067086     2         4185787 2014-10-16 06:18:56.000067086 zz         2014    10  17   4      3     6 
         ...
         ...
         2017-08-31 06:18:56.000067086     2         4185787 2014-10-16 06:18:56.000067086 zz         2017     8  31   3      2     6
...

下面的代码可以做到这一点,但是速度太慢了(比如对于100个不同的level0索引需要6h,而我必须达到9k)(cols只是这里不相关但需要的其他col的列表)

kk=0
for ind in df_test.index:
    print("#############",kk-1)
    inter=df_test.loc[ind]

    if inter["Date"]< minff:
        start=minff
    else:
        start=inter["Date"]
    dt_index=pd.date_range(start=start,end=maxMD)
    ind0=inter['TF-Objektnummer']
    cc=0
    for indd in dt_index:
        inter2=pd.DataFrame(index= pd.MultiIndex.from_tuples([(ind0,indd)], names=['TF-objNr','Date']),columns=cols+["Year","Month","Day","DoW","Season","Hour"])
        if cc%523==12:
            print(cc)
        inter2.loc[ind0,indd][cols]=inter[cols]
        inter2.loc[ind0,indd]["Year"]=indd.year
        mm=indd.month
        inter2.loc[ind0,indd]["Month"]=mm
        inter2.loc[ind0,indd]["Day"]=indd.day        
        inter2.loc[ind0,indd]["DoW"]=indd.dayofweek        
        inter2.loc[ind0,indd]["Hour"]=indd.round("h").hour
        if mm in [12,1,2]:
            inter2.loc[ind0,pd.to_datetime(indd)]["Season"]=0
        elif mm in [3,4,5]:
            inter2.loc[ind0,pd.to_datetime(indd)]["Season"]=1
        elif mm in [6,7,8]:
            inter2.loc[ind0,pd.to_datetime(indd)]["Season"]=2
        else:
            inter2.loc[ind0,pd.to_datetime(indd)]["Season"]=3

        cc+=1
        if kk==0:
            df_timeser=inter2.copy()
            kk+=1
        else:
            df_timeser=df_timeser.append(inter2)
    if kk%500==100:
        df_timeser.to_csv("/myDir/df_timeser"+str(kk)+"_"+str(ind0)+".csv",index=True)
    kk+=1

如果有人想知道,我想做一个更好的预测模型和丰富它的天气预报,所以我需要一个条目为每一天(地点由TF Objektnummer提供)


Tags: todfdateindextfstartlocseason
1条回答
网友
1楼 · 发布于 2024-04-20 12:19:52

我想你可以用:

from  itertools import product

minff=pd.to_datetime('2001-01-07 00:00:00')
maxMD=pd.to_datetime('2017-08-31 08:33:34.000057100')

#create tuples with replace Dates by condition
tup = list(zip(df["TF-Objektnummer"].tolist(), 
               df["Date"].mask(df["Date"] < minff, minff).tolist()))

print (tup)
[(4259619, Timestamp('2001-01-07 00:00:00')), 
 (4186279, Timestamp('2014-10-20 06:42:23.000056098')), 
 (4185787, Timestamp('2014-10-16 06:18:56.000067086')), 
 (4259599, Timestamp('2001-01-07 00:00:00'))]

#create product of date ranges and flatten output
tup1 =  [i for a, b in tup for i in list(product([a], pd.date_range(start=b,end=maxMD)))]
#final MultiIndex
mux = pd.MultiIndex.from_tuples(tup1, names=['TF-objNr','Date'])

#reindex by MultiIndex
df = df.set_index('TF-Objektnummer').reindex(mux, level=0)

#add new columns
idd = df.index.get_level_values(1)
df['Year'] = idd.year
df['Month'] = idd.month
df['Day'] = idd.day
df['DoW'] = idd.dayofweek
df['Hour'] = idd.round("h").hour
df["Season"] = (df['Month'] % 12 + 3) // 3 - 1

print (df.head())
                     index                          Date other_cols  Year  \
TF-objNr Date                                                               
4259619  2001-01-07      0 1970-01-01 13:45:41.000014557         xx  2001   
         2001-01-08      0 1970-01-01 13:45:41.000014557         xx  2001   
         2001-01-09      0 1970-01-01 13:45:41.000014557         xx  2001   
         2001-01-10      0 1970-01-01 13:45:41.000014557         xx  2001   
         2001-01-11      0 1970-01-01 13:45:41.000014557         xx  2001   

                     Month  Day  DoW  Hour  Season  
TF-objNr Date                                       
4259619  2001-01-07      1    7    6     0       0  
         2001-01-08      1    8    0     0       0  
         2001-01-09      1    9    1     0       0  
         2001-01-10      1   10    2     0       0  
         2001-01-11      1   11    3     0       0  

相关问题 更多 >