如何在多重索引数据框中针对特定组进行pandas列的减法?

0 投票
1 回答
56 浏览
提问于 2025-04-13 20:34

我想要从每个组的平均行中减去对应的子列。这意味着我需要用 Dribbling_Speed_Team_Blue 减去与 Best Player Statistics 对应的 Dribbling_Speed

所以,最后一行(Difference)将会有以下的值 -

enter image description here

这是我之前生成这个数据框的工作 -

我使用嵌套字典创建了一个多重索引的 pandas 数据框 -

import pandas as pd
nested_dict = {
    'Game':{
    'Basketball': {
        'Player Statistics': {
            'Dribbling_Speed_Team_Blue': {
                'Player_A': 1,
                'Player_B': 3
            },
              'Dribbling_Speed_Team_Red': {
                'Player_A': 2,
                'Player_B': 4
            }
        },
       'Best Player Statistics': {
            'Dribbling_Speed': {
                'Player': 20,
            }
        }

    },

        'Football': {
        'Best Player Statistics': {
            'Kicking_Power': {
                'Player_A': 12,
                'Player_B': 8
            }
        },
       'Player Statistics': {
            'Kicking_Power_Team_Blue': {
                'Player': 40,
            },
            'Kicking_Power_Team_Red': {
                'Player': 40,
            }
        }

    },
    }
}

然后我对它进行了以下操作 -

out = pd.json_normalize(nested_dict)
out.columns = out.columns.str.split('.', expand=True)

sum_data = out.groupby(level=[0, 1,2,3], axis = 1).sum()
count_data = out.groupby(level=[0, 1,2,3], axis = 1).count()
result_df = pd.concat([sum_data, count_data], axis=0, keys=['Sum', 'Count'])
result_df.index = result_df.index.droplevel(-1)
result_df.loc['avg'] = result_df.loc['Sum']/result_df.loc['Count']

这样做得到了以下结果 -

enter image description here

如果有什么不清楚的地方,请告诉我。我对这个过程还很陌生,所以不知道从哪里开始。

1 个回答

1

我会尝试这样做:

import pandas as pd

# I am creating classes, just to make it easier for you to add functions to identify the best players.
class Player:
    def __init__(self, team, game):
        self.data = {'team': team, 'game': game}
    
    def get_data(self):
        return self.data
    
class BasketballPlayer(Player):
    def __init__(self, team, dribbling_speed):
        super(BasketballPlayer, self).__init__(team, 'basketball')
        self.data['dribbling_speed'] = dribbling_speed
    
class FootballPlayer(Player):
    def __init__(self, team, kicking_power):
        super(FootballPlayer, self).__init__(team, 'football')
        self.data['kicking_power'] = kicking_power
        
        
# while generating the data, you could do the following, as example:
players = []
players.append(BasketballPlayer('TeamRed', 1).get_data())
players.append(BasketballPlayer('TeamRed', 3).get_data())
players.append(BasketballPlayer('TeamBlue', 2).get_data())
players.append(BasketballPlayer('TeamBlue', 4).get_data())
players.append(BasketballPlayer('BestPlayer', 20).get_data())
players.append(FootballPlayer('BestPlayer', 12).get_data())
players.append(FootballPlayer('BestPlayer', 8).get_data())
players.append(FootballPlayer('TeamRed', 40).get_data())
players.append(FootballPlayer('TeamBlue', 40).get_data())

# now create the dataframe
df = pd.DataFrame(players)


# get the difference
means = df.groupby(['game', 'team']).mean()
means.apply(lambda x: x - means.loc[(x.name[0], 'BestPlayer')], axis=1)

这样会得到:

结果

虽然这个结果和你想要的格式不完全一样,但在计算方面有几个好处。

因为我不知道字符串数据是怎么读取的,所以我假设你可以逐个读取文件中的玩家数据。为此,我准备了玩家类。一旦数据变成这种形式,计算差异就变得简单了,因为你可以直接使用 pd.apply 和一个 lambda 函数。

撰写回答