高效地使用Pandas进行多重索引透视

3 投票
1 回答
653 浏览
提问于 2025-04-18 06:38

我正在使用大约300 MB的金融数据,这些数据是拍卖中的限价单。它是多维数据,长得像这样:

                                bid                                                                                                                                                                                                                                                                                                                                                                                       ask                                                                                                                                                                                                                                                                                                                                                                           
                                  0                  1                  2                  3                  4                  5                  6                  7                  8                  9                 10                 11                 12                 13                 14                 15                 16                 17                 18               19                  0                  1                  2                  3                  4                  5                  6                  7                  8                  9                 10                 11                 12                 13                 14                 15               16               17               18               19          
                              price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity  price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity    price  quantity  price  quantity  price  quantity  price  quantity  price  quantity
2014-05-13 08:47:16.180000  102.298   1000000  102.297   1500000  102.296   6500000  102.295   8000000  102.294   3000000  102.293  24300000  102.292   6000000  102.291   1000000  102.290   1000000  102.289   2500000  102.288  11000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   6100000  102.304  14700000  102.305   3500000  102.307   9800000  102.308  15500000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.003000  102.298   1000000  102.297   2500000  102.296   6500000  102.295   7000000  102.294   3000000  102.293  24300000  102.292   6000000  102.291   1000000  102.290   1000000  102.289   2500000  102.288  11000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   5100000  102.304  14700000  102.305   4500000  102.307   9800000  102.308  15500000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.005000  102.298   3000000  102.297   3500000  102.296   6000000  102.295   9300000  102.294   4000000  102.293  17500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000      NaN       NaN    NaN       NaN  102.302   2000000  102.303   5100000  102.304  14700000  102.305   4500000  102.307   9000000  102.308  16300000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000      NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.006000  102.299   1000000  102.298   3000000  102.297   6500000  102.296   5000000  102.295   5300000  102.294   4000000  102.293  15500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000    NaN       NaN  102.302   2000000  102.303   5100000  102.304  11700000  102.305   7500000  102.307   9000000  102.308  11300000  102.309   5000000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN
2014-05-13 08:47:17.007000  102.299   1000000  102.298   3000000  102.297   8500000  102.296   4000000  102.295   4300000  102.294   5000000  102.293  14500000  102.292   2000000  102.291   4000000  102.290   1000000  102.289   2500000  102.288   6000000  102.287   4000000  102.286  10100000  102.284   5000000  102.280   1500000  102.276   3000000  102.275   8100000  102.265   9500000    NaN       NaN  102.302   2000000  102.303   4100000  102.304  13700000  102.305   7500000  102.307   8000000  102.308  12300000  102.309   5000000  102.310   5000000  102.312   7000000  102.313   1000000  102.315   8000000  102.316   4500000  102.320   4000000  102.321   1000000  102.324   4000000  102.325   9500000    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN

(注意,当你到达20时,第一层的变化。抱歉表格格式有点长……)

我需要对这些数据进行一些透视操作。例如,数据中不是用0、1、2、3这样的数字(表示订单在队列中的相对位置),而是用102.297、102.296等,也就是订单的价格作为索引。这里有一个这样的操作示例:

x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])

结果是:

                                         quantity
                           side price            
2014-05-13 08:47:17.003000 ask  102.300         0
                                102.301         0
                                102.302         0
                                102.303  -1000000
                                102.304         0

这个操作可以通过结合使用stack/unstack/reset_index来实现,但看起来效率真的很低。我还没查看代码,但我猜每次使用stackunstack时,都会复制一份表格,这导致我8GB的系统内存不够用,开始使用页面文件。我觉得在这种情况下也不能使用pivot,因为所需的列在多重索引中。

有没有什么建议可以让我加快这个过程?

这里有一个示例输入的csv文件,按照评论的要求:

side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A

1 个回答

1

Unstack的意思是把数据的行和列重新排列,这样一来,如果你的数据有很多列和行,就会占用很大的内存空间。

这里有一个解决方案,虽然速度比较慢,但我觉得它的内存使用峰值会低很多。这样做可能会让总的空间稍微小一点,因为原始数据中有些零值在这里可能不会出现(不过你可以重新索引并填充来解决这个问题)。

定义这个函数,可能可以针对这种情况进行优化(因为已经在按级别分组了)。

In [79]: def f(x):                                                              
    try:
        y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
        return y[y!=0].dropna()
    except:
        return None
   ....:     

对列进行按'级别'分组,然后应用函数f;不要直接使用apply,而是把结果作为行连接起来(这就是'反堆叠'的部分)。

不过这样会产生重复数据(在价格级别上),所以需要对它们进行汇总。

In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]: 
value                                 quantity
                        side price            
2014-05-13 08:47:17.003 ask  102.303  -1000000
                             102.305   1000000
                        bid  102.295  -1000000
                             102.297   1000000
2014-05-13 08:47:17.005 ask  102.307   -800000
                             102.308    800000
                        bid  102.288  -5000000
                             102.291   3000000
                             102.292  -4000000
                             102.293  -6800000
                             102.294   1000000
                             102.295   2300000
                             102.296   -500000
                             102.297   1000000
                             102.298   2000000
2014-05-13 08:47:17.006 ask  102.304  -3000000
                             102.305   3000000
                             102.308  -5000000
                             102.309   5000000
                             102.310         0
                             102.312         0
                             102.313         0
                             102.315         0
                             102.316         0
                             102.320         0
                             102.321         0
                             102.324         0
                             102.325         0
                        bid  102.265  -9500000
                             102.275         0
                             102.276         0
                             102.280         0
                             102.284         0
                             102.286         0
                             102.287         0
                             102.288         0
                             102.289         0
                             102.290         0
                             102.291         0
                             102.292         0
                             102.293  -2000000
                             102.294         0
                             102.295  -4000000
                             102.296  -1000000
                             102.297   3000000
                             102.298         0
                             102.299   1000000
2014-05-13 08:47:17.007 ask  102.303  -1000000
                             102.304   2000000
                             102.307  -1000000
                             102.308   1000000
                        bid  102.293  -1000000
                             102.294   1000000
                             102.295  -1000000
                             102.296  -1000000
                             102.297   2000000

时间记录(我觉得优化函数f会让这个过程快很多)。

In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop

In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop

原始方法

In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop

In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop

撰写回答