Python向dataframe添加多列,其中行和列是相互依赖的

2024-06-16 10:50:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经设法用for循环实现了这一点,但是在我正在使用的大型数据集上,它们的速度非常慢,所以我尝试用pandas、groupby、apply和lamda函数来实现这一点。你知道吗

import pandas as pd
example_df = pd.DataFrame({"scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
                      "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
                      "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
                      "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
                      "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
                      "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
                      "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0]})

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
    if(example_df['par_val'][x] == 0):
        example_df['book_val'][x] = 0
    else:
        example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


    example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

棘手的是,每一行的账面价值取决于前一行的摊销价值,而每一摊销价值取决于同一行的账面价值。看看这里对一个类似问题的回答,我认为有一种方法可以做到这一点,即使用全局变量来跟踪以前的值。你知道吗

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

比如:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

Tags: indfexamplevalcfrowapply价值
1条回答
网友
1楼 · 发布于 2024-06-16 10:50:03

毫无疑问,会有一个基于groupby的智能熊猫解决方案。但是,只要用numba重写循环,就可以获得大约1000倍的性能改进。你知道吗

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df)       # 929 µs per loop
%timeit original(df)  # 1.05 s per loop

基准代码

原件:

def original(example_df):
    for x in range(1, len(example_df['scen'])):

        if (example_df['cusip'][x] == example_df['cusip'][x-1]):

        # If bond matures, don't report book value
            if(example_df['par_val'][x] == 0):
                example_df['book_val'][x] = 0
            else:
                example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


            example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
    return example_df

麻木:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
    n = len(par)
    for i in range(1, n):
        if cusip[i] == cusip[i-1]:
            if par[i] == 0:
                book[i] == 0
            else:
                book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
            amort[i] = (book[i] - par[i]) / wal[i] / 12
    return book, amort


def jpp(df):
    df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
                                                    df['book_val'].values, df['amortization'].values,
                                                    df['prin_cf'].values, df['wal'].values)

    return df

相关问题 更多 >