Python Pandas与Numpy数组的搜索性能对比
我正在尝试在pandas的某一列中搜索一个字符串。我听说先对这一列进行排序,然后用searchsorted方法来查找字符串,这样速度会最快。但是我发现,这种方法比直接在同一个numpy数组上用暴力搜索(数组 == 字符串)要慢。为了弄清楚原因,我进行了以下测试:
import timeit
setup4 = '''
import numpy as np, string, random
d = np.array([
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16))
for _ in range(20000)
],dtype=np.object)
'''
setup5 = '''
import numpy as np, pandas as pd, string, random
header = [
u'A',
u'B',
u'C',
u'D',
u'E',
u'F',
u'G',
u'H',
u'I',
u'J',
u'K',
u'L',
u'M',
u'N'
]
data = [[pd.to_datetime('20140505'),
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
u'sfgweorfjdfl',
u'dsiofqjwel;dmfv',
u'e3ruiwefjvgoiubg',
u'3124oirjrg;klhbas',
u';3rhfgfbnvsad3r',
pd.to_datetime('20140505'),
u'1234irtjurgbfas',
u'12;rhfd;hb;oasere',
u'124urgfdnv.,sadfg',
u'1rfnhsdjk.dhafgsrdew',
u'safeklrjh2nerfgsd.'
] for _ in range(20000)]
df = pd.DataFrame(data,columns=header)
df_sorted = df.sort(['B','C'])
e = df_sorted['B'].values
'''
setup6 = '''
import numpy as np, pandas as pd, string, random
header = [
u'A',
u'B',
u'C',
u'D',
u'E',
u'F',
u'G',
u'H',
u'I',
u'J',
u'K',
u'L',
u'M',
u'N'
]
data = [[pd.to_datetime('20140505'),
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
u'sfgweorfjdfl',
u'dsiofqjwel;dmfv',
u'e3ruiwefjvgoiubg',
u'3124oirjrg;klhbas',
u';3rhfgfbnvsad3r',
pd.to_datetime('20140505'),
u'1234irtjurgbfas',
u'12;rhfd;hb;oasere',
u'124urgfdnv.,sadfg',
u'1rfnhsdjk.dhafgsrdew',
u'safeklrjh2nerfgsd.'
] for _ in range(20000)]
df = pd.DataFrame(data,columns=header)
f = df['B'].values
'''
print(timeit.timeit("index = d == u'ASDASD123ASADKHX'", setup=setup4,number=1000))
print(timeit.timeit("index = e == u'ASDASD123ASADKHX'", setup=setup5,number=1000))
print(timeit.timeit("index = f == u'ASDASD123ASADKHX'", setup=setup6,number=1000))
得到了以下结果:
print(timeit.timeit("index = d == u'ASDASD123ASADKHX'", setup=setup4,number=1000))
0.808505267014
print(timeit.timeit("index = e == u'ASDASD123ASADKHX'", setup=setup5,number=1000))
3.06733738226
print(timeit.timeit("index = f == u'ASDASD123ASADKHX'", setup=setup6,number=1000))
1.64207848896
我想问的是:为什么纯numpy数组的性能要好得多?我怎样才能用从pandas表中提取的数据达到同样的性能呢?
非常感谢。
2 个回答
0
我在IPython中测试了你的代码,除了未排序的数据框外,其他所有版本的性能基本相同。
In [85]:
import numpy as np, string, random
d = np.array([
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16))
for _ in range(20000)
],dtype=np.object)
header = [
u'A',
u'B',
u'C',
u'D',
u'E',
u'F',
u'G',
u'H',
u'I',
u'J',
u'K',
u'L',
u'M',
u'N'
]
data = [[pd.to_datetime('20140505'),
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
u'sfgweorfjdfl',
u'dsiofqjwel;dmfv',
u'e3ruiwefjvgoiubg',
u'3124oirjrg;klhbas',
u';3rhfgfbnvsad3r',
pd.to_datetime('20140505'),
u'1234irtjurgbfas',
u'12;rhfd;hb;oasere',
u'124urgfdnv.,sadfg',
u'1rfnhsdjk.dhafgsrdew',
u'safeklrjh2nerfgsd.'
] for _ in range(20000)]
df = pd.DataFrame(data,columns=header)
df_sorted = df.sort(['B','C'])
e = df_sorted['B'].values
f = df['B'].values
%timeit index = d == u'ASDASD123ASADKHX'
%timeit index = e == u'ASDASD123ASADKHX'
%timeit index = f == u'ASDASD123ASADKHX'
1000 loops, best of 3: 536 µs per loop
1000 loops, best of 3: 568 µs per loop
1000 loops, best of 3: 538 µs per loop
0
在DataFrame中,每个字符串都是一个对象,从df['B'].values
得到的是一个对象数组。但是,当你用np.array()
创建字符串数组时,它返回的是一个每个字符串都使用相同字节数的数组。
这里有个例子,a
是一个数据类型为S10
的数组,b
是一个数据类型为对象的数组。
import numpy as np
import random
import string
words = ["".join(random.choice(string.ascii_uppercase) for _ in range(10)) for _ in range(10000)]
a = np.array(words)
b = a.astype("O")
%timeit a == "123"
%timeit b == "123"
输出结果:
10000 loops, best of 3: 122 µs per loop
10000 loops, best of 3: 164 µs per loop