Pandas优化

0 投票

1 回答

953 浏览

数据工程师

提问于 2025-04-18 10:51

我写了一个用pandas处理数据的函数。下面贴出了我用%prun对这个函数进行性能分析的日志（只贴了前几行）。我想优化我的代码，因为我需要调用这个函数超过4000次，而运行一次这个函数就花了37.7秒。

看起来最耗时间的部分是numpy.ndarray中的nonzero。由于我几乎所有的操作都是基于pandas的，我想知道在pandas中哪些函数会大量依赖这个方法？

我的操作主要是基于datetimeindex进行数据框切片，使用df.ix[]，以及使用pandas.merge()进行数据框合并。

我知道不发我的实际代码很难判断，但我的代码太长了，发出来也没什么意义，而且大部分操作都是临时的，所以我不能把它重写成小代码发到这里。

         16439731 function calls (16108083 primitive calls) in 37.766 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7461    3.712    0.000    3.712    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
      244    1.731    0.007    5.434    0.022 index.py:1126(_partial_date_slice)
      122    1.655    0.014    1.655    0.014 {pandas.algos.inner_join_indexer_int64}
      610    1.578    0.003    1.578    0.003 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
   118817    0.764    0.000    0.764    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    22474    0.753    0.000    0.917    0.000 index.py:409(is_unique)
   353210    0.669    0.000    1.228    0.000 {numpy.core.multiarray.array}
  1577935    0.596    0.000    0.925    0.000 {isinstance}
     1221    0.511    0.000    0.516    0.000 index.py:402(is_monotonic)
      183    0.427    0.002    0.427    0.002 {pandas.algos.left_outer_join}
    34529    0.376    0.000    1.286    0.000 index.py:98(__new__)
    12356    0.358    0.000    0.358    0.000 {method 'take' of 'numpy.ndarray' objects}
     3812    0.352    0.000    0.352    0.000 {pandas.algos.take_2d_axis0_int64_int64}
      610    0.344    0.001    0.349    0.001 index.py:35(wrapper)
      981    0.334    0.000    0.335    0.000 {method 'copy' of 'numpy.ndarray' objects}

性能优化运行时间数据处理数据框数据合并切片操作函数分析

1 个回答

df.ix[]这个方法有点不太稳定，它主要是通过标签来查找数据，但也可以用整数位置来作为备用。建议你使用.loc[]这个方法。如果你只传入一个标签，它会返回那个标签对应的行数据。你也可以通过传入一个范围来切片数据。所以，不要这样写：

df.ix[begin_date:end_date]

可以试试这样：

df.loc[begin_date:end_date]

如果想要更快，可以使用基于整数的切片方法.iloc[]。因为你本来就要遍历索引，所以可以在循环中加上enumerate()，然后使用enumerate()的值，也就是：

df.iloc[4:9]

在我的电脑上，.iloc的速度大约是.loc的两倍。

回答于 2025-04-18 由 Python大师

分享举报

Pandas优化

1 个回答

撰写回答