如何将数据框从长格式转换为宽格式并带有“分组”列?
在将下面这个数据表从长格式转换为宽格式时,我想把一些列分成“组”,并给它们加上前缀或后缀。
- 这些元素的组可以有不同的大小,也就是说,可能由一个、两个或更多的元素/行组成。为了让例子简单,我这里用的是两个一组。
import pandas as pd
df = pd.DataFrame(
[
{'group': 'group-009297534', 'single_id': 'single-011900051', 'country': 'ESP', 'name': '00000911'},
{'group': 'group-009297534', 'single_id': 'single-000000821', 'country': 'USA', 'name': '00001054'},
{'group': 'group-009280053', 'single_id': 'single-000000002', 'country': 'HUN', 'name': '00000496'},
{'group': 'group-009280053', 'single_id': 'single-000000014', 'country': 'HUN', 'name': '00000795'},
{'group': 'group-009245039', 'single_id': 'single-000001258', 'country': 'NOR', 'name': '00000527'},
{'group': 'group-009245039', 'single_id': 'single-000000669', 'country': 'TWN', 'name': '00000535'}
]
)
我给要分组的元素分配一个索引,然后用这个索引来命名列,这个方法已经朝着正确的方向走了,但结果还是和我预期的有些偏差。
df['idx'] = df.groupby('group').cumcount()
df.pivot(index='group', columns='idx')
group | ('single_id', 0) | ('single_id', 1) | ('country', 0) | ('country', 1) | ('name', 0) | ('name', 1) |
---|---|---|---|---|---|---|
group-009245039 | single-000001258 | single-000000669 | NOR | TWN | 00000527 | 00000535 |
group-009280053 | single-000000002 | single-000000014 | HUN | HUN | 00000496 | 00000795 |
group-009297534 | single-011900051 | single-000000821 | ESP | USA | 00000911 | 00001054 |
不过,我预期的结果应该是这样的:
group | single_id_1 | country_1 | name_1 | single_id_2 | country_2 | name_2 | |
---|---|---|---|---|---|---|---|
0 | group-009245039 | single-000001258 | NOR | 00000527 | single-000000669 | TWN | 00000535 |
1 | group-009280053 | single-000000002 | HUN | 00000496 | single-000000014 | HUN | 00000795 |
2 | group-009297534 | single-011900051 | ESP | 00000911 | single-000000821 | USA | 00001054 |
我不太确定用多重索引的方法是否合适,因为这可能需要进行排序和合并,或者是否有更简单的办法。
3 个回答
2
你可以使用来自janitor的 pivot_wider
函数。
import janitor
df['cols'] = df.groupby('group').cumcount() + 1
df.pivot_wider(index = 'group', names_from = 'cols')
group single_id_1 single_id_2 country_1 country_2 name_1 name_2
0 group-009245039 single-000001258 single-000000669 NOR TWN 527 535
1 group-009280053 single-000000002 single-000000014 HUN HUN 496 795
2 group-009297534 single-011900051 single-000000821 ESP USA 911 1054
同样,你也可以使用来自siuba的 pivot_wider
函数:
from siuba import _
from siuba.experimental.pivot import pivot_wider
df['cols'] = df.groupby('group').cumcount() + 1
pivot_wider(df, names_from = _.cols,
values_from = _['single_id', 'country', 'name'],
names_vary = 'slowest')
group single_id_1 country_1 name_1 single_id_2 country_2 name_2
0 group-009245039 single-000001258 NOR 527 single-000000669 TWN 535
1 group-009280053 single-000000002 HUN 496 single-000000014 HUN 795
2 group-009297534 single-011900051 ESP 911 single-000000821 USA 1054
2
这就是你想要的东西吗?
import pandas as pd
df = pd.DataFrame([
{'group': 'group-009297534', 'single_id': 'single-011900051', 'country': 'ESP', 'name': '00000911'},
{'group': 'group-009297534', 'single_id': 'single-000000821', 'country': 'USA', 'name': '00001054'},
{'group': 'group-009280053', 'single_id': 'single-000000002', 'country': 'HUN', 'name': '00000496'},
{'group': 'group-009280053', 'single_id': 'single-000000014', 'country': 'HUN', 'name': '00000795'},
{'group': 'group-009245039', 'single_id': 'single-000001258', 'country': 'NOR', 'name': '00000527'},
{'group': 'group-009245039', 'single_id': 'single-000000669', 'country': 'TWN', 'name': '00000535'}
])
df['idx'] = (df.groupby('group').cumcount() + 1).astype(str)
# Pivot the DataFrame with the new 'idx' for differentiation
df_pivoted = df.pivot(index='group', columns='idx')
# Flatten the MultiIndex and format column names
df_pivoted.columns = [f'{x[0]}_{x[1]}' for x in df_pivoted.columns]
# Reset the index to bring 'group' back as a column
df_pivoted.reset_index(inplace=True)
# Optional: Reorder the columns according to your expected output
# This assumes you know the order and number of groups
expected_order = [
'group',
'single_id_1', 'country_1', 'name_1',
'single_id_2', 'country_2', 'name_2'
]
df_pivoted = df_pivoted[expected_order]
print(df_pivoted)
输出结果:
group single_id_1 country_1 name_1 single_id_2 \
0 group-009245039 single-000001258 NOR 00000527 single-000000669
1 group-009280053 single-000000002 HUN 00000496 single-000000014
2 group-009297534 single-011900051 ESP 00000911 single-000000821
country_2 name_2
0 TWN 00000535
1 HUN 00000795
2 USA 00001054