在循环中使用.head（）和pandas，性能较慢

import pandas as pd import numpy as np #create dictionary of 10000 dataframes numdfs = 10000 alldf = {i:pd.DataFrame({'a':np.random.randn(250),'b':np.random.randn(250),'c':np.random.randn(250),'d':np.random.randn(250)}) for i in range(numdfs)} count = 250 runningsum = 0 for i in range(numdfs): df = alldf[i].head(count) df['is negative'] = (df['b'] < 0).cummax().astype(int) runningsum += df['is negative'].max()

1条回答

网友

1楼 · 发布于 2024-04-20 11:55:27

(df['b'] < 0).cummax().astype(int).max()只检查是否有值小于0。您可以改为使用(df['b'] < 0).any()。有也不需要int转换，因为考虑了布尔值 1/0分别为True/False。在

顺便说一句，loc/iloc往往比其他形式的切片更有效，但这并不是导致性能差的主要原因，尽管您进行了测试。在

对于等效算法，可以使用带sum的生成器表达式：

sum((v.loc[:250, 'b'] < 0).any() for v in alldf.values())

以下是一些性能基准：

^{pr2}$

该算法仍然相当慢，因为如果发现小于零的值可能会短路，则不必要地检查'b'中的每个单个值。使用numba可以实现一种循环方式，它比原始算法提高了~12000x倍

from numba import njit

@njit
def any_below_zero(arr, k):
    for i in range(k):
        if arr[i] < 0:
            return 1
    return 0

def jpp_nb(alldf):
    return sum(any_below_zero(v['b'].values, 250) for v in alldf.values())

%timeit jpp_nb(alldf)    # 525 µs

对于10000个数据帧，就像在您的测试中一样，这将在不到一秒钟的时间内起作用：

numdfs = 10**5     # create 10000 dataframes
alldf = {i: pd.DataFrame({col: np.random.randn(250) for col in 'abcd'}) \
         for i in range(numdfs)}

%timeit jpp_nb(alldf)    # 746 ms

相关问题更多 >

编程相关推荐

热门问题

热门文章