pandas - 非常非常慢

1 投票

1 回答

10589 浏览

提问于 2025-04-18 12:37

我正在尝试对日期对象使用 df.apply，但速度慢得让人受不了！！

我的 prun 输出是……

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1999   14.563    0.007   14.563    0.007 {pandas.tslib.array_to_timedelta64}
 13998    0.103    0.000   15.221    0.001 series.py:126(__init__)
  9999    0.093    0.000    0.093    0.000 {method 'reduce' of 'numpy.ufunc' objects}
272012    0.093    0.000    0.125    0.000 {isinstance}
  5997    0.089    0.000    0.196    0.000 common.py:199(_isnull_ndarraylike)

简单来说，对于一个长度为 2000 的数组，运行时间是 14 秒。我的实际数组大小超过 100,000，这样算下来运行时间就超过 15 分钟，甚至可能更多。

pandas 竟然把这个函数叫做 "pandas.tslib.array_to_timedelta64"，这真是个瓶颈？我真的不明白为什么这个函数调用是必要的？？两个相减的操作数都是同一种数据类型。我在之前明确使用 pd.to_datetime() 方法进行了转换。而且，这个转换的时间并没有算在这个计算里。

所以，你可以理解我对这段糟糕代码的沮丧！！！

实际代码看起来是这样的

 df  = pd.DataFrame(bet_endtimes)

def testing():
    close_indices = df.apply(lambda x: np.argmin(np.abs(currentdata['date'] - x[0])),axis=1)
    print close_indices

 %prun testing()

性能优化运行时间数据处理数据转换时间序列应用函数数据帧代码瓶颈

1 个回答

我建议你去看看文档：http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-deltas。另外，提供一些示例数据会非常有帮助，这样我就不用猜测你在做什么了。

使用 apply 方法总是最后才尝试的操作。使用向量化的方法会快得多。

In [55]: pd.set_option('max_rows',10)

In [56]: df = DataFrame(dict(A = pd.date_range('20130101',periods=100000, freq='s')))

In [57]: df
Out[57]: 
                        A
0     2013-01-01 00:00:00
1     2013-01-01 00:00:01
2     2013-01-01 00:00:02
3     2013-01-01 00:00:03
4     2013-01-01 00:00:04
...                   ...
99995 2013-01-02 03:46:35
99996 2013-01-02 03:46:36
99997 2013-01-02 03:46:37
99998 2013-01-02 03:46:38
99999 2013-01-02 03:46:39

[100000 rows x 1 columns]

In [58]:  (df['A']-df.loc[10,'A']).abs()
Out[58]: 
0   00:00:10
1   00:00:09
2   00:00:08
...
99997   1 days, 03:46:27
99998   1 days, 03:46:28
99999   1 days, 03:46:29
Name: A, Length: 100000, dtype: timedelta64[ns]

In [59]: %timeit  (df['A']-df.loc[10,'A']).abs()
1000 loops, best of 3: 1.47 ms per loop

当你为 pandas 做贡献时，可以给方法命名。

pandas 叫这个函数为 "pandas.tslib.array_to_timedelta64" 真是太傻了，这成了瓶颈？在这个计算中并没有包含时间。

回答于 2025-04-18 由 Python大师

分享举报

pandas - 非常非常慢

1 个回答

撰写回答