根据特定列和分组在Pandas中将数据从列转换为行
这里是样本数据
data = {'Game': ['NFS', 'Forza', 'Wreckfest','Dirt Rally','Burnout','Project Cars','Grid 2','GTA','Saints Row','Persona 5','COD','Battlefield','Counter Strike'],
'Game Tier' : ['Tier 1', 'Tier 2', 'Tier 2','Tier 2','Tier 3','Tier 3','Tier 4','Tier 1','Tier 3','Tier 2','Tier 1','Tier 1','Tier 2'],
'Genre' : ['Racing', 'Racing', 'Racing','Racing','Racing','Racing','Racing','Open World','Open World','RPG','Shooter','Shooter','Shooter']}
df = pd.DataFrame (data, columns = ['Game', 'Game Tier', 'Genre'])
df
用这些数据生成的数据框
Game Game_Tier Genre
0 NFS Tier 1 Racing
1 Forza Tier 2 Racing
2 Wreckfest Tier 2 Racing
3 Dirt Rally Tier 2 Racing
4 Burnout Tier 3 Racing
5 Project Cars Tier 3 Racing
6 Grid 2 Tier 4 Racing
7 GTA Tier 1 Open World
8 Saints Row Tier 3 Open World
9 Persona 5 Tier 2 RPG
10 COD Tier 1 Shooter
11 Battlefield Tier 1 Shooter
12 Counter Strike Tier 2 Shooter
初始数据最多可以有2万条记录。每个类型的游戏数量都少于100个。
我想把这些数据转换成这样:每个游戏在某个层级上都要对应到特定类型的下一个层级。
层级最多可以到5级,最终的输出应该有5个层级的游戏数据,即使某些层级没有数据也要显示。
最终输出应该像这样
Tier_1_Game Tier_2_Game Tier_3_Game Tier_4_Game Tier_5_Game Genre
0 NFS Forza Burnout Grid 2 Racing
1 NFS Wreckfest Burnout Grid 2 Racing
2 NFS Dirt Rally Burnout Grid 2 Racing
3 NFS Forza Project Cars Grid 2 Racing
4 NFS Wreckfest Project Cars Grid 2 Racing
5 NFS Dirt Rally Project Cars Grid 2 Racing
6 GTA Saints Row Open World
7 Persona 5 RPG
8 COD Counter Strike Shooter
9 Battlefield Counter Strike Shooter
关于我们如何得到这个最终数据的解释:
首先,我们考虑“赛车”这个类型,选择其中的1级游戏,比如NFS,然后把它映射到下面。
Tier_1_Game Tier_2_Game Tier_3 Game Tier_4_Game Tier_5_Game Genre
0 NFS Racing
接下来,在赛车类型的2级中有3个游戏,我们把它们和NFS对应起来,如下所示。
Tier_1_Game Tier_2_Game Tier_3_Game Tier_4_Game Tier_5_Game Genre
0 NFS Forza Racing
1 NFS Wreckfest Racing
2 NFS Dirt Rally Racing
然后在3级中有2个游戏,所以我们创建2个已经存在数据的条目。
Tier_1_Game Tier_2_Game Tier_3_Game Tier_4_Game Tier_5_Game Genre
0 NFS Forza Burnout Racing
1 NFS Wreckfest Burnout Racing
2 NFS Dirt Rally Burnout Racing
3 NFS Forza Project Cars Racing
4 NFS Wreckfest Project Cars Racing
5 NFS Dirt Rally Project Cars Racing
现在在4级中只有1个游戏Grid 2,所以我们把它放入现有记录中。
Tier_1_Game Tier_2_Game Tier_3_Game Tier_4_Game Tier_5_Game Genre
0 NFS Forza Burnout Grid 2 Racing
1 NFS Wreckfest Burnout Grid 2 Racing
2 NFS Dirt Rally Burnout Grid 2 Racing
3 NFS Forza Project Card Grid 2 Racing
4 NFS Wreckfest Project Cars Grid 2 Racing
5 NFS Dirt Rally Project Cars Grid 2 Racing
假设在初始数据中还有另一个游戏
'The Crew' 在赛车类型的4级,那么最终输出就会变成这样。
Tier_1_Game Tier_2_Game Tier_3_Game Tier_4_Game Tier_5_Game Genre
0 NFS Forza Burnout Grid 2 Racing
1 NFS Wreckfest Burnout Grid 2 Racing
2 NFS Dirt Rally Burnout Grid 2 Racing
3 NFS Forza Project Cars Grid 2 Racing
4 NFS Wreckfest Project Cars Grid 2 Racing
5 NFS Dirt Rally Project Cars Grid 2 Racing
6 NFS Forza Burnout The Crew Racing
7 NFS Wreckfest Burnout The Crew Racing
8 NFS Dirt Rally Burnout The Crew Racing
9 NFS Forza Project Cars The Crew Racing
10 NFS Wreckfest Project Cars The Crew Racing
11 NFS Dirt Rally Project Cars The Crew Racing
现在我们已经把赛车类型的所有层级都添加完了,接下来我们会去下一个类型,继续添加,以此类推。
有没有人有什么想法可以实现这个?
环境是Python 3.6,Pandas 1.0.1
1 个回答
0
代码
cols = ['Tier 1', 'Tier 2', 'Tier 3', 'Tier 4', 'Tier 5']
def get_tier(df):
dfs = {_: d['Game'] for _, d in df.groupby('Game Tier')}
return (pd.MultiIndex.from_product(dfs.values(), names=dfs.keys())
.to_frame(index=False).reindex(cols, axis=1)
.add_suffix(' Game')
.assign(Genre=df['Genre'].iloc[0])
)
out = (df.groupby('Genre', group_keys=False, sort=False)
.apply(get_tier).reset_index(drop=True)
)
输出