loc比向后循环查找数据帧中的第一个匹配项快吗?

2024-06-08 22:45:48 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个很大的数据帧,包含如下列:

Date | Person 1 | Person 2 | Value 1 |  Value 2
+----------------------------------------------+

假设数据帧是从最旧到最新排序的。你知道吗

现在,我想迭代这个数据帧。你知道吗

  • 对于每一行,我首先看人1。呼叫此手机号码\u 1\u id

  • 对于person1,我想取最近的前一行get Value 1,并执行复杂的计算。

当前获取最新值1(v1)的方法是:

value1s = df.loc[(df.ID1 == Person_1_id) & (df.Date < date)]
v1 = value1s.iloc[-1]

据我所知,loc将循环并获得满足条件的所有先前值。你知道吗

简单地向上循环数据帧,并选取满足条件的第一行不是更快吗?你知道吗

如果是这样的话,如何在数据帧上向后迭代?你知道吗

编辑:示例:

我的初始表:

DATE        Person 1    Person 2    value 1 value 2
13/08/2019  71          19          1000    1000
16/08/2019  19          68          1000    1000
19/08/2019  30          98          1000    1000
22/08/2019  42          32          1000    1000
25/08/2019  19          78          1000    1000

算法:

  • 迭代每一行。将当前行称为“current\u row”。所有计算都将更新此“当前行”值1
  • 获取人员1的Id(“人员1”列中的数字)。让我们以人19为例
  • 查找人员19出现在“人员1”或“人员2”列中的最新前一行“最新\u最新\u上一行”

执行以下计算:

flag = 0
if person in 'Person 1' then flag = 1
new_value = most_recent_prev_row['value 1'] + flag * 0.5 * (most_recent_prev_row['value 2']
current_row['Value 1'] = new_value

例如,更新上表中人员19的第二行:

DATE        Person 1    Person 2    value 1               value 2
13/08/2019  19          71          1000                  1000
16/08/2019  19          68          1000+0.5*1000=1500    1000

如果第一行是:

DATE        Person 1    Person 2    value 1               value 2
13/08/2019  71          19          1000                  1000
16/08/2019  19          68          1000-0.5*1000=1500    1000

最后,我的计算代码如下。它是一行一行地应用的,速度非常慢:

# helper function to calculate new value
def calculate(value1, value2, flag):
   new_value = value1 + flag * 0.5 * value2

# function to update value
def updateValue(playerId, date):        
    # default value if player has no wins or losses
    score = 1000

    # get win and losses for the player. Players in 'Person 1' won, players in 'Person 2' lost.
    wins = df.loc[(df['Person 1'] == playerId) & (df.DATE < date)]
    losses = df.loc[(df['Person 2'] == playerId) & (df.DATE < date)]

    # player only has wins
    if not wins.empty and losses.empty:
        result_row = wins.iloc[-1]
        score = calculate(result_row.value1, result_row.value2, 1)

    # player only has losses
    if wins.empty and not losses.empty:
        result_row = losses.iloc[-1]
        score = calculate(result_row.value1, result_row.value2, 0)

    # player has wins and losses
    if not wins.empty and not losses.empty:        
        p1_win_row = wins.iloc[-1]
        p1_lost_row = losses.iloc[-1]

        result_row = pd.DataFrame()

        if p1_win_row.DATE < p1_lost_row.DATE:
            result_row = losses.iloc[-1]
            score = calculate(result_row.value1, result_row.value2, 0)
        else:
            result_row = wins.iloc[-1]
            score = calculate(result_row.value1, result_row.value2, 1)

    return score

Tags: dfdateif人员valueresultpersonrow