如何将数据框从长格式转换为宽格式并带有“分组”列?

0 投票
3 回答
77 浏览
提问于 2025-04-12 17:09

在将下面这个数据表从长格式转换为宽格式时,我想把一些列分成“组”,并给它们加上前缀或后缀。

  • 这些元素的组可以有不同的大小,也就是说,可能由一个、两个或更多的元素/行组成。为了让例子简单,我这里用的是两个一组。
import pandas as pd

df = pd.DataFrame(
    [
        {'group': 'group-009297534',  'single_id': 'single-011900051',  'country': 'ESP',  'name': '00000911'},
        {'group': 'group-009297534',  'single_id': 'single-000000821',  'country': 'USA',  'name': '00001054'},
        {'group': 'group-009280053',  'single_id': 'single-000000002',  'country': 'HUN',  'name': '00000496'},
        {'group': 'group-009280053',  'single_id': 'single-000000014',  'country': 'HUN',  'name': '00000795'},
        {'group': 'group-009245039',  'single_id': 'single-000001258',  'country': 'NOR',  'name': '00000527'},
        {'group': 'group-009245039',  'single_id': 'single-000000669',  'country': 'TWN',  'name': '00000535'}
    ]
)

我给要分组的元素分配一个索引,然后用这个索引来命名列,这个方法已经朝着正确的方向走了,但结果还是和我预期的有些偏差。

df['idx'] = df.groupby('group').cumcount()
df.pivot(index='group', columns='idx')
group ('single_id', 0) ('single_id', 1) ('country', 0) ('country', 1) ('name', 0) ('name', 1)
group-009245039 single-000001258 single-000000669 NOR TWN 00000527 00000535
group-009280053 single-000000002 single-000000014 HUN HUN 00000496 00000795
group-009297534 single-011900051 single-000000821 ESP USA 00000911 00001054

不过,我预期的结果应该是这样的:

group single_id_1 country_1 name_1 single_id_2 country_2 name_2
0 group-009245039 single-000001258 NOR 00000527 single-000000669 TWN 00000535
1 group-009280053 single-000000002 HUN 00000496 single-000000014 HUN 00000795
2 group-009297534 single-011900051 ESP 00000911 single-000000821 USA 00001054

我不太确定用多重索引的方法是否合适,因为这可能需要进行排序和合并,或者是否有更简单的办法。

3 个回答

2

你可以使用来自janitor的 pivot_wider 函数。

import janitor
df['cols'] = df.groupby('group').cumcount() + 1
df.pivot_wider(index = 'group', names_from = 'cols')

            group      single_id_1      single_id_2 country_1 country_2 name_1 name_2
0 group-009245039 single-000001258 single-000000669       NOR       TWN    527    535
1 group-009280053 single-000000002 single-000000014       HUN       HUN    496    795
2 group-009297534 single-011900051 single-000000821       ESP       USA    911   1054

同样,你也可以使用来自siuba的 pivot_wider 函数:

from siuba import _
from siuba.experimental.pivot import pivot_wider

df['cols'] = df.groupby('group').cumcount() + 1
pivot_wider(df, names_from = _.cols, 
                values_from = _['single_id', 'country', 'name'],
                names_vary = 'slowest')

            group      single_id_1 country_1 name_1      single_id_2 country_2 name_2
0 group-009245039 single-000001258       NOR    527 single-000000669       TWN    535
1 group-009280053 single-000000002       HUN    496 single-000000014       HUN    795
2 group-009297534 single-011900051       ESP    911 single-000000821       USA   1054
2

这就是你想要的东西吗?

import pandas as pd

df = pd.DataFrame([
    {'group': 'group-009297534',  'single_id': 'single-011900051',  'country': 'ESP',  'name': '00000911'},
    {'group': 'group-009297534',  'single_id': 'single-000000821',  'country': 'USA',  'name': '00001054'},
    {'group': 'group-009280053',  'single_id': 'single-000000002',  'country': 'HUN',  'name': '00000496'},
    {'group': 'group-009280053',  'single_id': 'single-000000014',  'country': 'HUN',  'name': '00000795'},
    {'group': 'group-009245039',  'single_id': 'single-000001258',  'country': 'NOR',  'name': '00000527'},
    {'group': 'group-009245039',  'single_id': 'single-000000669',  'country': 'TWN',  'name': '00000535'}
])

df['idx'] = (df.groupby('group').cumcount() + 1).astype(str)


# Pivot the DataFrame with the new 'idx' for differentiation
df_pivoted = df.pivot(index='group', columns='idx')

# Flatten the MultiIndex and format column names
df_pivoted.columns = [f'{x[0]}_{x[1]}' for x in df_pivoted.columns]

# Reset the index to bring 'group' back as a column
df_pivoted.reset_index(inplace=True)

# Optional: Reorder the columns according to your expected output
# This assumes you know the order and number of groups
expected_order = [
    'group', 
    'single_id_1', 'country_1', 'name_1',
    'single_id_2', 'country_2', 'name_2'
]
df_pivoted = df_pivoted[expected_order]

print(df_pivoted)

输出结果:

  group       single_id_1 country_1    name_1       single_id_2  \
0  group-009245039  single-000001258       NOR  00000527  single-000000669   
1  group-009280053  single-000000002       HUN  00000496  single-000000014   
2  group-009297534  single-011900051       ESP  00000911  single-000000821   

  country_2    name_2  
0       TWN  00000535  
1       HUN  00000795  
2       USA  00001054 
2

代码

out = (df
       .assign(index=df.groupby('group').cumcount().add(1).astype('str'))
       .pivot(index='group', columns='index')
       .pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1))
       .sort_index(axis=1, 
                   key=lambda x: x.str.extract(r'.+_(\d+$)')[0].astype('int'))
       .reset_index()
)

输出

在这里输入图片描述

撰写回答