Python将一列值拆分为多列并保持其他列不变
我有一些数据,如下所示。在这些数据中,我想把A列的值拆分成不同的列,并且要把对应的B列的值放在一起,同时C列要包含对应的组的值。请注意,这只是一个示例,实际上还有多个这样的组和它们对应的值。
输入数据
Column1 Column2 Column3
Group1 Value1 V1
Group1 Value2 V2
Group1 Value3 V3
Group1 Value4 V4
Group2 Value1 x1
Group2 Value2 x2
Group2 Value3 x3
Group2 Value4 x4
Group3 Value1 y1
Group3 Value2 y2
期望的输出结果:
Group1 Group2 Group3 Column3.Group1 Column3.Group2 Column3.Group3
Value1 Value1 Value1 v1 x1 y1
Value2 Value2 Value2 v2 x2 y1
Value3 Value3 NaN v3 x3 NaN
Value4 Value4 NaN v4 x4 NaN
有没有办法在Python中实现这个?我在论坛上搜索过,但没有找到……我对Python还很陌生,所以如果能有解释和解决方案就太好了。谢谢!
3 个回答
0
使用 GroupBy.cumcount
和 DataFrame.pivot
,最后在列表推导中扁平化 MultiIndex
:
out = (df.assign(g=df.groupby('Column1').cumcount())
.pivot(index='g', columns='Column1')
.rename_axis(None))
out.columns = [b if a == 'Column2' else f'{a}.{b}' for a, b in out.columns]
print (out)
Group1 Group2 Group3 Column3.Group1 Column3.Group2 Column3.Group3
0 Value1 Value1 Value1 V1 x1 y1
1 Value2 Value2 Value2 V2 x2 y2
2 Value3 Value3 NaN V3 x3 NaN
3 Value4 Value4 NaN V4 x4 NaN
如果原始的数据表有更多的列,但只需要处理 Column1, Column2, Column3
这几列,可以添加参数 value
:
out = (df.assign(g=df.groupby('Column1').cumcount())
.pivot(index='g', columns='Column1', values=['Column2','Column3'])
.rename_axis(None))
out.columns = [b if a == 'Column2' else f'{a}.{b}' for a, b in out.columns]
print (out)
Group1 Group2 Group3 Column3.Group1 Column3.Group2 Column3.Group3
0 Value1 Value1 Value1 V1 x1 y1
1 Value2 Value2 Value2 V2 x2 y2
2 Value3 Value3 NaN V3 x3 NaN
3 Value4 Value4 NaN V4 x4 NaN
0
这是一个经典的透视表操作,通常我们会去掉重复的列,但在这里我们是去掉重复的行。
为了做到这一点,我们需要先用 pivot
方法进行透视,前提是先用 cumcount
方法去掉重复的列:
out = (df
.assign(index=df.groupby('Column1').cumcount())
.pivot(index='index', columns='Column1')
.rename_axis(index=None)
)
out.columns = out.columns.map('.'.join)
输出结果:
Column2.Group1 Column2.Group2 Column2.Group3 Column3.Group1 Column3.Group2 Column3.Group3
0 Value1 Value1 Value1 V1 x1 y1
1 Value2 Value2 Value2 V2 x2 y2
2 Value3 Value3 NaN V3 x3 NaN
3 Value4 Value4 NaN V4 x4 NaN
对于你最开始问题中的原始格式:
out = (df
.assign(index=df.groupby('Column1').cumcount())
.pivot(index='index', columns='Column1').droplevel(0, axis=1)
.rename_axis(index=None, columns=None)
)
输出结果:
Group1 Group2 Group3 Group1 Group2 Group3
0 Value1 Value1 Value1 V1 x1 y1
1 Value2 Value2 Value2 V2 x2 y2
2 Value3 Value3 NaN V3 x3 NaN
3 Value4 Value4 NaN V4 x4 NaN
我个人建议保留多重索引,这样可以让层级关系更加清晰:
out = (df
.assign(index=df.groupby('Column1').cumcount())
.pivot(index='index', columns='Column1')
.rename_axis(index=None, columns=[None, None])
)
输出结果:
Column2 Column3
Group1 Group2 Group3 Group1 Group2 Group3
0 Value1 Value1 Value1 V1 x1 y1
1 Value2 Value2 Value2 V2 x2 y2
2 Value3 Value3 NaN V3 x3 NaN
3 Value4 Value4 NaN V4 x4 NaN
0
tmp = df.assign(cc=df.groupby('Column1').cumcount())
out = pd.concat(
[tmp.pivot(index='cc', columns='Column1', values='Column2'),
tmp.pivot(index='cc', columns='Column1', values='Column3').add_prefix('Column3.')
], axis=1).rename_axis(index=None, columns=None)
输出:
Group1 Group2 Group3 Column3.Group1 Column3.Group2 Column3.Group3
0 Value1 Value1 Value1 V1 x1 y1
1 Value2 Value2 Value2 V2 x2 y2
2 Value3 Value3 NaN V3 x3 NaN
3 Value4 Value4 NaN V4 x4 NaN