如何在多重索引数据框中针对特定组进行pandas列的减法?
我想要从每个组的平均行中减去对应的子列。这意味着我需要用 Dribbling_Speed_Team_Blue
减去与 Best Player Statistics
对应的 Dribbling_Speed
。
所以,最后一行(Difference
)将会有以下的值 -
这是我之前生成这个数据框的工作 -
我使用嵌套字典创建了一个多重索引的 pandas 数据框 -
import pandas as pd
nested_dict = {
'Game':{
'Basketball': {
'Player Statistics': {
'Dribbling_Speed_Team_Blue': {
'Player_A': 1,
'Player_B': 3
},
'Dribbling_Speed_Team_Red': {
'Player_A': 2,
'Player_B': 4
}
},
'Best Player Statistics': {
'Dribbling_Speed': {
'Player': 20,
}
}
},
'Football': {
'Best Player Statistics': {
'Kicking_Power': {
'Player_A': 12,
'Player_B': 8
}
},
'Player Statistics': {
'Kicking_Power_Team_Blue': {
'Player': 40,
},
'Kicking_Power_Team_Red': {
'Player': 40,
}
}
},
}
}
然后我对它进行了以下操作 -
out = pd.json_normalize(nested_dict)
out.columns = out.columns.str.split('.', expand=True)
sum_data = out.groupby(level=[0, 1,2,3], axis = 1).sum()
count_data = out.groupby(level=[0, 1,2,3], axis = 1).count()
result_df = pd.concat([sum_data, count_data], axis=0, keys=['Sum', 'Count'])
result_df.index = result_df.index.droplevel(-1)
result_df.loc['avg'] = result_df.loc['Sum']/result_df.loc['Count']
这样做得到了以下结果 -
如果有什么不清楚的地方,请告诉我。我对这个过程还很陌生,所以不知道从哪里开始。
1 个回答
1
我会尝试这样做:
import pandas as pd
# I am creating classes, just to make it easier for you to add functions to identify the best players.
class Player:
def __init__(self, team, game):
self.data = {'team': team, 'game': game}
def get_data(self):
return self.data
class BasketballPlayer(Player):
def __init__(self, team, dribbling_speed):
super(BasketballPlayer, self).__init__(team, 'basketball')
self.data['dribbling_speed'] = dribbling_speed
class FootballPlayer(Player):
def __init__(self, team, kicking_power):
super(FootballPlayer, self).__init__(team, 'football')
self.data['kicking_power'] = kicking_power
# while generating the data, you could do the following, as example:
players = []
players.append(BasketballPlayer('TeamRed', 1).get_data())
players.append(BasketballPlayer('TeamRed', 3).get_data())
players.append(BasketballPlayer('TeamBlue', 2).get_data())
players.append(BasketballPlayer('TeamBlue', 4).get_data())
players.append(BasketballPlayer('BestPlayer', 20).get_data())
players.append(FootballPlayer('BestPlayer', 12).get_data())
players.append(FootballPlayer('BestPlayer', 8).get_data())
players.append(FootballPlayer('TeamRed', 40).get_data())
players.append(FootballPlayer('TeamBlue', 40).get_data())
# now create the dataframe
df = pd.DataFrame(players)
# get the difference
means = df.groupby(['game', 'team']).mean()
means.apply(lambda x: x - means.loc[(x.name[0], 'BestPlayer')], axis=1)
这样会得到:
虽然这个结果和你想要的格式不完全一样,但在计算方面有几个好处。
因为我不知道字符串数据是怎么读取的,所以我假设你可以逐个读取文件中的玩家数据。为此,我准备了玩家类。一旦数据变成这种形式,计算差异就变得简单了,因为你可以直接使用 pd.apply
和一个 lambda
函数。