Pandas优化数据帧多条件反向嵌套循环

2024-04-27 03:55:50 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我写了一段代码,效果很好。但是,它太慢了,因为我经常多次运行这段代码。我想用矢量化操作来优化它,但我很难找到这样做的方法,因为我还不是熊猫方面的绝对专家

def slowFunctionToOptimize():
    # Variables definition
    minVolume = 2000
    exchange1 = 'binance'
    exchange2 = 'bitmart'
    volEx1Str = 'volume_' + exchange1
    volEx2Str = 'volume_' + exchange2
    threshold = 15.0
    minDuration = 10.0

    # See below for an example dataset
    dataset = pd.read_csv('example.csv', sep='|')
    
    indicesLst = dataset.index.values
    minIndexLst = indicesLst[0]
    
    # Get all indices that exceed or are equal to the specified threshold,
    # and normalize with the first index value to work with "iloc" later on
    indicesThresh = dataset.index[dataset.diffprice >= threshold].values - minIndexLst
    
    pv = None
    prevEndIndex = len(dataset)
    
    # Get the largest possible amount of rows (sequential order) based on the volume mean
    #  of the two exchanges, where the first value exceed or is equal to the threshold
    for startInd in indicesThresh:
        for endInd in range(prevEndIndex, 0, -1):
            if endInd - startInd < minDuration:
                break
    
            dfTmp = dataset.iloc[startInd:endInd, :]
            avgVolume1 = dfTmp[volEx1Str].mean()
            avgVolume2 = dfTmp[volEx2Str].mean()
            if avgVolume1 > minVolume and avgVolume2 > minVolume:
                # Get the final result.
                pv = dfTmp.copy()
                break
    
        # Largest amount of rows found, exiting
        if pv is not None:
            break
    
        prevEndIndex = startInd
    
    if pv is None:
        print('No combination could be found for this iteration.')
        return
    
    return pv

以下是“example.csv”数据集:

^{tb1}$

以下是预期输出(函数中的返回变量“pv”):

^{tb2}$

1条回答
网友
1楼 · 发布于 2024-04-27 03:55:50

要优化此代码,可以执行以下几项操作:

  1. 通过为bitmart和binance添加“累积体积”列,可以更有效地计算平均值
^{tb1}$

然后,平均体积就是dataset['cumulative volume'][startInd] - dataset['cumulative volume'][endInd]

  1. 只更新所需的数据:复制数据帧效率很低,因此应避免一直更新dfTmp。只需跟踪startIndendInd并使用前面的技巧计算平均体积

您可能还可以使用其他一些技巧,但如果不知道您正在使用的数据的确切类型,我想我无法为您提供更多帮助

相关问题 更多 >