Pandas系列的设定速度很慢,为什么?

2024-04-26 00:34:44 发布

您现在位置:Python中文网/ 问答频道 /正文

问题

有人知道为什么在熊猫系列中直接设置一个项目是如此之慢吗?是我做错了什么,还是就这样?

我运行了几个测试,看看在pandas系列对象上设置值最快的方法是什么。以下是从快到慢的结果:

初始化数组,使用整数索引设置,创建序列

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

1000圈,最好为3:630微秒/圈

创建空列表,使用append添加项,创建序列

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)

1000个回路,最好3个:每个回路1.05ms

初始化数组,创建序列,使用set-value

设置
%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i, 1.0)

100个回路,最好3个:每个回路18.5 ms

初始化数组,创建序列,使用整数索引设置

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0

10个回路,最好每回路3:30.2 ms

初始化数组,创建序列,使用iat

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0

10个回路,最好3个:每个回路36.2 ms

初始化数组,创建序列,使用iloc

%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0

1个回路,最好3个:每个回路280ms


Tags: infordatalennprange序列数组
3条回答

docs

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for.

所以我得到了以下应该是可比的:

In [13]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iat[i] = 1.0
10 loops, best of 3: 23.3 ms per loop
In [14]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.iloc[i] = 1.0
10 loops, best of 3: 159 ms per loop

对于其他测试:

In [15]:

%%timeit
l = []
for i in range(1000):
    l.append(1.0)
s = pd.Series(data=l)
1000 loops, best of 3: 525 µs per loop
In [16]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s.set_value(i,1.0)
100 loops, best of 3: 10.1 ms per loop
In [17]:

%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    s[i] = 1.0
100 loops, best of 3: 17.5 ms per loop

我想好了直接在series对象上设置值时如何避免索引开销:

a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
    a[i] = 1.0

从numpy数组初始化序列时,不会复制数据。如果保留对原始数组的引用,则可以对其设置值!

我认为这些方法对于将一个序列初始化为一个常量值更快:

基线

%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
    a[i] = 1.0
s = pd.Series(data=a)

10000 loops, best of 3: 121 µs per loop

备选方案

%%timeit
s = pd.Series(np.empty(1000, dtype='float')) * 1.

10000 loops, best of 3: 99.5 µs per loop

%%timeit
constant = 5.
s = pd.Series(np.ones(1000)) * constant

10000 loops, best of 3: 85.3 µs per loop

相关问题 更多 >

    热门问题