如何使用Pandas中的groupby高效地添加新列?

2024-06-16 13:01:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用nba_py获取一些NBA比赛的记分牌数据。你知道吗

以下是数据结构示例:

    SEASON |     GAME_DATE_EST | GAME_SEQUENCE | GAME_ID | HOME_TEAM_ID | VISITOR_TEAM_ID | WINNER

0   2013    2013-10-05T00:00:00     1            11300001   12321         1610612760        V
1   2013    2013-10-05T00:00:00     2            11300002   1610612754    1610612741        V
2   2013    2013-10-05T00:00:00     3            11300003   1610612745    1610612740        V
3   2013    2013-10-05T00:00:00     4            11300004   1610612747    1610612744        H
4   2013    2013-10-06T00:00:00     1            11300005   12324         1610612755        V

您可以在这里找到部分数据:NBA Games Data

我的目标是创建以下列并将其添加到原始数据中:

对于主队:

   1. Total wins/losses for hometeam if hometeam plays at home ("HOMETEAM_HOME_WINS"/"HOMETEAM_HOME_LOSSES")
   2. Total wins/losses for hometeam if hometeam is visiting ("HOMETEAM_VISITOR_WINS"/"HOMETEAM_VISITOR_LOSSES")

对于访客团队:

   3. Total wins/losses for visitor_team if visitor_team plays at home ("VISITOR_TEAM_HOME_WINS"/"VISITOR_TEAM_HOME_LOSSES")
   4. Total wins/losses for visitor_team if visitor_team is visiting ("VISITOR_TEAM_VISITOR_WINS"/"VISITOR_TEAM_VISITOR_LOSSES")

我的第一个简单方法如下:

def get_home_team_home_wins(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name
    season_hometeam_games = grouped_seasons_hometeams.get_group((season, hometeam))
    home_games = season_hometeam_games[(season_hometeam_games.index < gid)]

    if not home_games.empty:
        try:
            home_wins = home_games.FTR.value_counts()["H"]
        except Exception as e:
            home_wins = 0
    else:
        home_wins = 0

grouped_seasons_hometeams = df.groupby(["SEASON", "HOME_TEAM_ID"])

df["HOMETEAM_HOME_WINS"] = df.apply(lambda x: get_home_team_home_wins(x), axis=1)

另一种方法是iterating over the rows

grouped_seasons = df.groupby("SEASON")
df["HOMETEAM_HOME_WINS"] = 0

current_season = 0
for index,row in df.iterrows():
    season = row.SEASON
    if season != current_season:
        current_season = season
        season_games = grouped_seasons.get_group(current_season)

    hometeam = row.HOME_TEAM_ID
    gid = row.name
    games = season_games[(season_games.index < gid)]
    home_games = games[(games.HOME_TEAM_ID == hometeam)]

    if not home_games.empty:
        try:
            home_wins = home_games.FTR.value_counts()["H"]
        except Exception as e:
            home_wins = 0
    else:
        home_wins = 0

    row["HOME_TEAM_HOME_WINS_4"] = home_wins
    df.ix[index] = row

更新1:

如果主队在主场比赛或是来访,下面有一些函数可以用来获取主队的胜负。一个类似的将是为访客团队。你知道吗

def get_home_team_home_wins_losses(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name

    games = df[(df.SEASON == season) & (df.index < gid)]
    home_team_home_games = games[(games.HOME_TEAM_ID == hometeam)]  


    # HOMETEAM plays at home
    if not home_team_home_games.empty:
        home_team_home_games_value_counts = home_team_home_games.FTR.value_counts()

        try:
            home_team_home_wins = home_team_home_games_value_counts["H"]
        except Exception as e:
            home_team_home_wins = 0

        try:
            home_team_home_losses = home_team_home_games_value_counts["V"]
        except Exception as e:
            home_team_home_losses = 0
    else:
        home_team_home_wins = 0
        home_team_home_losses = 0

    return [home_team_home_wins, home_team_home_losses]

def get_home_team_visitor_wins_losses(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name

    games = df[(df.SEASON == season) & (df.index < gid)]
    home_team_visitor_games = games[(games.VISITOR_TEAM_ID == hometeam)]

    # HOMETEAM visits
    if not home_team_visitor_games.empty:
        home_team_visitor_games_value_counts = home_team_visitor_games.FTR.value_counts()

        try:
            home_team_visitor_wins = home_team_visitor_games_value_counts["V"]
        except Exception as e:
            home_team_visitor_wins = 0

        try:
            home_team_visitor_losses = home_team_visitor_games_value_counts["H"]
        except Exception as e:
            home_team_visitor_losses = 0
    else:
        home_team_visitor_wins = 0
        home_team_visitor_losses = 0    

    return [home_team_visitor_wins, home_team_visitor_losses]

df["HOME_TEAM_HOME_WINS"], df["HOME_TEAM_HOME_LOSSES"] = zip(*df.apply(lambda x: get_home_team_home_wins_losses(x), axis=1))
df["HOME_TEAM_VISITOR_WINS"], df["HOME_TEAM_VISITOR_LOSSES"] = zip(*df.apply(lambda x: get_home_team_visitor_wins_losses(x), axis=1))
df["HOME_TEAM_WINS"] = df["HOME_TEAM_HOME_WINS"] + df["HOME_TEAM_VISITOR_WINS"]
df["HOME_TEAM_LOSSES"] = df["HOME_TEAM_HOME_LOSSES"] + df["HOME_TEAM_VISITOR_LOSSES"]

上述方法效率不高。所以,我正在考虑使用groupby,但不清楚如何使用。你知道吗

我会添加更新,每当我发现一些更有效的。你知道吗

有什么想法吗?谢谢。你知道吗


Tags: iddfhomevaluegamesteamseasonvisitor
1条回答
网友
1楼 · 发布于 2024-06-16 13:01:31

考虑使用transform(),但首先有条件地创建HOMEWINNERVISITWINNER整数列。用numpy.where()注释掉的等价if/else计算更容易阅读,您可能/可能没有作为一个包提供。你知道吗

请注意transform()保留所有行,但将按ID聚合,因此特定HOME_TEAM_ID的每个记录都应在这些聚合列中重复值

nbadf['VISITWINNER'] =  [1 if x == 'V' else 0 for x in nbadf['WINNER']]
#nbadf['VISITWINNER'] = np.where(nbadf['WINNER']=='V', 1, 0)

nbadf['HOMEWINNER'] = [1 if x == 'H' else 0 for x in nbadf['WINNER']]    
#nbadf['HOMEWINNER'] = np.where(nbadf['WINNER']=='H', 1, 0)

nbadf['HOME_TEAM_WINS'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\ 
                                        ['HOMEWINNER'].transform(sum)
nbadf['HOME_TEAM_LOSSES'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\
                                          ['VISITWINNER'].transform(sum)

nbadf['VISIT_TEAM_WINS'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
                                         ['VISITWINNER'].transform(sum)
nbadf['VISIT_TEAM_LOSSES'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
                                           ['HOMEWINNER'].transform(sum)

nbadf.drop(['HOMEWINNER', 'VISITWINNER'],inplace=True,axis=1)

#   SEASON  ...  WINNER  HOME_TEAM_WINS  HOME_TEAM_LOSSES  VISIT_TEAM_WINS  VISIT_TEAM_LOSSES
#0    2013  ...      V               0                 1                1                  0
#1    2013  ...      V               0                 1                1                  0
#2    2013  ...      V               0                 1                1                  0
#3    2013  ...      H               1                 0                0                  1
#4    2013  ...      V               0                 1                1                  0

现在,对于主队稍后访问的实例,反之亦然,请考虑将id与子集合的数据帧合并(如果需要,请更改列号)。这就抓住了主队,主队也是客队。因此,在mergedf上运行上述聚合(并使用此时间WINNER_xVISITWINNER使用WINNER_y计算相同的条件HOMEWINNER):

# MERGES HOME SUBSET DF AND VISITOR SUBSET DF
mergedf = pd.merge(nbadf[[0,1,2,3,4,6]], nbadf[[0,1,2,3,5,6]],
                   left_on=['HOME_TEAM_ID'], right_on=['VISITOR_TEAM_ID'], how='inner')

mergedf['HOMETEAM_AS_VISITOR_WINS'] = mergedf.groupby(['VISITOR_TEAM_ID','SEASON_y'])\ 
                                                      ['VISITWINNER'].transform(sum)

mergedf['VISITORTEAM_AS_HOME_WINS'] = mergedf.groupby(['HOME_TEAM_ID','SEASON_x'])\ 
                                                      ['HOMEWINNER'].transform(sum)

相关问题 更多 >