python pandas与numpy数组的搜索性能

2024-04-19 03:04:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在pandas列中搜索字符串。我已经读到,首先对列进行排序并使用searchsorted on the values搜索字符串应该是最快的。我发现这比在同一个numpy数组上搜索bruteforce(array==string)慢。为了了解原因,我进行了以下测试:

import timeit

setup4 = '''  
import numpy as np, string, random 

d =     np.array([
            u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16))
             for _ in range(20000)
             ],dtype=np.object)
'''



setup5 = '''  
import numpy as np, pandas as pd, string, random 

header = [
                    u'A',
                    u'B',
                    u'C',
                    u'D',
                    u'E',
                    u'F',
                    u'G',
                    u'H',
                    u'I',
                    u'J',
                    u'K',
                    u'L',
                    u'M',
                    u'N'
                    ]


data =     [[pd.to_datetime('20140505'),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u'sfgweorfjdfl',
                u'dsiofqjwel;dmfv',
                u'e3ruiwefjvgoiubg',
                u'3124oirjrg;klhbas',
                u';3rhfgfbnvsad3r',
                pd.to_datetime('20140505'),
                u'1234irtjurgbfas',
                u'12;rhfd;hb;oasere',
                u'124urgfdnv.,sadfg',
                u'1rfnhsdjk.dhafgsrdew',
                u'safeklrjh2nerfgsd.'
                ] for _ in range(20000)]

df = pd.DataFrame(data,columns=header)
df_sorted = df.sort(['B','C'])
e = df_sorted['B'].values
'''

setup6 = '''  
import numpy as np, pandas as pd, string, random 

header = [
                    u'A',
                    u'B',
                    u'C',
                    u'D',
                    u'E',
                    u'F',
                    u'G',
                    u'H',
                    u'I',
                    u'J',
                    u'K',
                    u'L',
                    u'M',
                    u'N'
                    ]


data =     [[pd.to_datetime('20140505'),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u'sfgweorfjdfl',
                u'dsiofqjwel;dmfv',
                u'e3ruiwefjvgoiubg',
                u'3124oirjrg;klhbas',
                u';3rhfgfbnvsad3r',
                pd.to_datetime('20140505'),
                u'1234irtjurgbfas',
                u'12;rhfd;hb;oasere',
                u'124urgfdnv.,sadfg',
                u'1rfnhsdjk.dhafgsrdew',
                u'safeklrjh2nerfgsd.'
                ] for _ in range(20000)]

df = pd.DataFrame(data,columns=header)
f = df['B'].values
'''

print(timeit.timeit("index = d == u'ASDASD123ASADKHX'", setup=setup4,number=1000))
print(timeit.timeit("index = e == u'ASDASD123ASADKHX'", setup=setup5,number=1000))
print(timeit.timeit("index = f == u'ASDASD123ASADKHX'", setup=setup6,number=1000))

结果如下:

^{pr2}$

我的问题是:为什么纯numpy阵列的性能会好得多?如何使用从pandas表中提取的数据实现同样的性能呢?在

非常感谢。在


Tags: innumpydfforstringasnpascii
2条回答

我在IPython中测试了您的代码,除了未排序的数据帧外,所有变体的性能几乎相同:

In [85]:

import numpy as np, string, random 

d =     np.array([
            u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16))
             for _ in range(20000)
             ],dtype=np.object)

header = [
                    u'A',
                    u'B',
                    u'C',
                    u'D',
                    u'E',
                    u'F',
                    u'G',
                    u'H',
                    u'I',
                    u'J',
                    u'K',
                    u'L',
                    u'M',
                    u'N'
                    ]


data =     [[pd.to_datetime('20140505'),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u'sfgweorfjdfl',
                u'dsiofqjwel;dmfv',
                u'e3ruiwefjvgoiubg',
                u'3124oirjrg;klhbas',
                u';3rhfgfbnvsad3r',
                pd.to_datetime('20140505'),
                u'1234irtjurgbfas',
                u'12;rhfd;hb;oasere',
                u'124urgfdnv.,sadfg',
                u'1rfnhsdjk.dhafgsrdew',
                u'safeklrjh2nerfgsd.'
                ] for _ in range(20000)]

df = pd.DataFrame(data,columns=header)
df_sorted = df.sort(['B','C'])
e = df_sorted['B'].values
f = df['B'].values
%timeit index = d == u'ASDASD123ASADKHX'
%timeit index = e == u'ASDASD123ASADKHX'
%timeit index = f == u'ASDASD123ASADKHX'
1000 loops, best of 3: 536 µs per loop
1000 loops, best of 3: 568 µs per loop
1000 loops, best of 3: 538 µs per loop

DataFrame中的每个字符串都是一个对象,从df['B'].values得到的是一个对象数组。但是当您通过np.array()创建一个字符串数组时,它将返回一个每个字符串使用相同字节计数的数组。在

下面是一个示例,a是一个具有S10数据类型的数组,b是一个对象数据类型的数组。在

import numpy as np
import random
import string
words = ["".join(random.choice(string.ascii_uppercase) for _ in range(10)) for _ in range(10000)]
a = np.array(words)
b = a.astype("O")
%timeit a == "123"
%timeit b == "123"

输出:

^{pr2}$

相关问题 更多 >