高效地使用Pandas进行多重索引透视
我正在使用大约300 MB的金融数据,这些数据是拍卖中的限价单。它是多维数据,长得像这样:
bid ask
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity
2014-05-13 08:47:16.180000 102.298 1000000 102.297 1500000 102.296 6500000 102.295 8000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 6100000 102.304 14700000 102.305 3500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.003000 102.298 1000000 102.297 2500000 102.296 6500000 102.295 7000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.005000 102.298 3000000 102.297 3500000 102.296 6000000 102.295 9300000 102.294 4000000 102.293 17500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9000000 102.308 16300000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.006000 102.299 1000000 102.298 3000000 102.297 6500000 102.296 5000000 102.295 5300000 102.294 4000000 102.293 15500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 5100000 102.304 11700000 102.305 7500000 102.307 9000000 102.308 11300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.007000 102.299 1000000 102.298 3000000 102.297 8500000 102.296 4000000 102.295 4300000 102.294 5000000 102.293 14500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 4100000 102.304 13700000 102.305 7500000 102.307 8000000 102.308 12300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
(注意,当你到达20时,第一层的变化。抱歉表格格式有点长……)
我需要对这些数据进行一些透视操作。例如,数据中不是用0、1、2、3这样的数字(表示订单在队列中的相对位置),而是用102.297、102.296等,也就是订单的价格作为索引。这里有一个这样的操作示例:
x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
结果是:
quantity
side price
2014-05-13 08:47:17.003000 ask 102.300 0
102.301 0
102.302 0
102.303 -1000000
102.304 0
这个操作可以通过结合使用stack/unstack/reset_index
来实现,但看起来效率真的很低。我还没查看代码,但我猜每次使用stack
或unstack
时,都会复制一份表格,这导致我8GB的系统内存不够用,开始使用页面文件。我觉得在这种情况下也不能使用pivot
,因为所需的列在多重索引中。
有没有什么建议可以让我加快这个过程?
这里有一个示例输入的csv文件,按照评论的要求:
side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
1 个回答
1
Unstack的意思是把数据的行和列重新排列,这样一来,如果你的数据有很多列和行,就会占用很大的内存空间。
这里有一个解决方案,虽然速度比较慢,但我觉得它的内存使用峰值会低很多。这样做可能会让总的空间稍微小一点,因为原始数据中有些零值在这里可能不会出现(不过你可以重新索引并填充来解决这个问题)。
定义这个函数,可能可以针对这种情况进行优化(因为已经在按级别分组了)。
In [79]: def f(x):
try:
y = x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
return y[y!=0].dropna()
except:
return None
....:
对列进行按'级别'分组,然后应用函数f;不要直接使用apply,而是把结果作为行连接起来(这就是'反堆叠'的部分)。
不过这样会产生重复数据(在价格级别上),所以需要对它们进行汇总。
In [76]: concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
Out[76]:
value quantity
side price
2014-05-13 08:47:17.003 ask 102.303 -1000000
102.305 1000000
bid 102.295 -1000000
102.297 1000000
2014-05-13 08:47:17.005 ask 102.307 -800000
102.308 800000
bid 102.288 -5000000
102.291 3000000
102.292 -4000000
102.293 -6800000
102.294 1000000
102.295 2300000
102.296 -500000
102.297 1000000
102.298 2000000
2014-05-13 08:47:17.006 ask 102.304 -3000000
102.305 3000000
102.308 -5000000
102.309 5000000
102.310 0
102.312 0
102.313 0
102.315 0
102.316 0
102.320 0
102.321 0
102.324 0
102.325 0
bid 102.265 -9500000
102.275 0
102.276 0
102.280 0
102.284 0
102.286 0
102.287 0
102.288 0
102.289 0
102.290 0
102.291 0
102.292 0
102.293 -2000000
102.294 0
102.295 -4000000
102.296 -1000000
102.297 3000000
102.298 0
102.299 1000000
2014-05-13 08:47:17.007 ask 102.303 -1000000
102.304 2000000
102.307 -1000000
102.308 1000000
bid 102.293 -1000000
102.294 1000000
102.295 -1000000
102.296 -1000000
102.297 2000000
时间记录(我觉得优化函数f会让这个过程快很多)。
In [77]: %timeit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
1 loops, best of 3: 319 ms per loop
In [78]: %memit concat([ f(grp) for g, grp in df.groupby(level='level',axis=1) ]).groupby(level=[0,1,2]).sum().sortlevel()
maximum of 1: 67.515625 MB per loop
原始方法
In [7]: %timeit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
10 loops, best of 3: 56.4 ms per loop
In [8]: %memit df.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
maximum of 1: 61.187500 MB per loop