pandas groupby变慢了
我可能做错了什么,因为我的Python脚本变得越来越慢。在这个脚本中,我有两列数据,我只想要每个共同查询值的最高分。我使用了pandas的groupby函数来实现这个功能。然后我只保留那些分数大于等于最高分90%的查询。
startTime = datetime.now()
data = pd.read_csv(inputfile,names =['query', 'score'],sep='\t')
print "INPUT INFORMATION"
print "Inputfile has:", "{:,}".format(data.shape[0]), "records"
print data.dtypes
print "Time test 1 :", str(datetime.now()-startTime)
data['max'] = data.groupby('queryid')['bitscore'].transform(lambda x: x.max())
print "Time test 2", str(datetime.now()-startTime)
data = data[data['bitscore']>=0.9*data['max']]
print "Time test 3", str(datetime.now()-startTime)
这是输出结果:
INPUT INFORMATION
Blast inputfile has: 1,367,808 records
queryid object
subjectid object
bitscore float64
dtype: object
Time test 1 : 0:00:05.075944
Time test 2 0:30:40.750674
Time test 3 0:30:41.317064
记录有很多,但还是... 电脑内存有100多GB。我昨天运行这个脚本,花了26分钟才到“test2”。现在已经30分钟了。你觉得我应该把Python卸载掉再重新安装吗?有人遇到过这种情况吗?
1 个回答
2
为了完整起见,这里使用的是0.14.1版本
In [13]: pd.set_option('max_rows',10)
In [9]: N = 1400000
In [10]: ngroups = 1000
In [11]: groups = [ "A%04d" % i for i in xrange(ngroups) ]
In [12]: df = DataFrame(dict(A = np.random.choice(groups,size=N,replace=True), B = np.random.randn(N)))
In [14]: df
Out[14]:
A B
0 A0722 0.621374
1 A0390 -0.843030
2 A0897 -1.633165
3 A0546 0.483448
4 A0366 1.866380
... ... ...
1399995 A0515 -1.051668
1399996 A0591 -1.216455
1399997 A0766 -0.914020
1399998 A0635 0.258893
1399999 A0577 1.874328
[1400000 rows x 2 columns]
In [15]: df.groupby('A')['B'].transform('max')
Out[15]:
0 3.688245
1 3.829529
2 3.717359
...
1399997 4.213080
1399998 3.121092
1399999 2.990630
Name: B, Length: 1400000, dtype: float64
In [16]: %timeit df.groupby('A')['B'].transform('max')
1 loops, best of 3: 437 ms per loop
In [17]: ngroups = 10000
In [18]: groups = [ "A%04d" % i for i in xrange(ngroups) ]
In [19]: df = DataFrame(dict(A = np.random.choice(groups,size=N,replace=True), B = np.random.randn(N)))
In [20]: %timeit df.groupby('A')['B'].transform('max')
1 loops, best of 3: 1.43 s per loop
In [23]: ngroups = 100000
In [24]: groups = [ "A%05d" % i for i in xrange(ngroups) ]
In [25]: df = DataFrame(dict(A = np.random.choice(groups,size=N,replace=True), B = np.random.randn(N)))
In [27]: %timeit df.groupby('A')['B'].transform('max')
1 loops, best of 3: 10.3 s per loop
所以这个转换大致上是O(组的数量)