将DataFrame中一列的唯一值扩展为多列

2 投票

3 回答

84 浏览

提问于 2025-04-14 16:34

我需要把一个数据框（DataFrame）转换成下面这种形状：

import pandas as pd
import numpy as np

# input DataFrame
df = pd.DataFrame({
   'foo': ['one', 'one', 'one', 'two', 'two', 'two', 'three', 'three', 'three'],
   'tak': ['dgad', 'dgad', 'dgad', 'ogfagas', 'ogfagas', 'ogfagas', 'adgadg', 'adgadg', 'adgadg'],
   'bar': ['B', 'B', 'A', 'C', 'A', 'C', 'C', 'C', 'C'],
   'nix': ['Z', 'Z', 'Z', 'G', 'G', 'G', 'Z', 'G', 'G']
})

... 目标是让 foo 和 tak 成为索引（也就是说，每个 foo 对应的 tak 值是唯一的，不会有多个不同的 tak 值）。对于 bar 和 nix（其实我还有10个不同的列需要这样处理），我需要把这些列转换成多个列，比如 bar_1 是每个索引下 bar 的第一个唯一值，bar_2 是第二个唯一值，以此类推。如果某个 foo 组下的 bar 或 nix 只有一个或没有唯一值，就要插入一个 np.nan。像这样：

# desired output DataFrame
pd.DataFrame({
   'foo': ['one', 'two', 'three'],
   'tak': ['dgad', 'ogfagas', 'adgadg'],
   'bar_one': ['B', 'C', 'C'],
   'bar_two': ['A', 'A', np.nan],
   'nix_one': ['Z' , 'G', 'Z'],
   'nix_two': [np.nan, np.nan, 'G']
})

我现在的做法是使用 .pivot_table 和这个聚合函数：

pivot_df = df.pivot_table(
   index=['foo', 'tak'],
   values=['bar', 'nix'],
   aggfunc = lambda x: list(set(x))
)

然后我把每个 foo-tak 组的唯一值列表扩展成多个列，并通过列表推导式把它们合并在一起：

pd.concat(
   [
       pivot_df[column].apply(pd.Series)
       for column in ['bar', 'nix']
   ],
   axis=1
)

有没有更简单、更直接、更符合 Python 风格的方法来完成这个转换？

列表推导式数据处理聚合函数数据转换数据框索引管理唯一值多列扩展

3 个回答

你可以试着使用 pd.groupby 和自定义的分组函数，而不是使用透视表：

def group_fn(g):
    out = {}
    for c in g.columns:
        for i, v in enumerate(g[c].unique(), 1):
            out[f"{c}_{i}"] = v
    return pd.Series(out.values(), index=out.keys())


out = (
    df.groupby(["foo", "tak"], sort=False)
    .apply(group_fn, include_groups=False)
    .unstack(2)
    .reset_index()
)
print(out)

输出结果是：

     foo      tak bar_1 bar_2 nix_1 nix_2
0    one     dgad     B     A     Z   NaN
1    two  ogfagas     C     A     G   NaN
2  three   adgadg     C   NaN     Z     G

回答于 2025-04-14 由 Python大师

分享举报

其实不需要用循环，你可以先用melt这个方法把数据转换一下，然后再用drop_duplicates来去掉重复的项。接着，利用groupby.cumcount来对每一组进行去重，最后再用pivot把数据转换回宽格式：

cols = ['foo', 'tak']

# melt, keep unique combinations, pivot back to wide form
out = (df
   .melt(cols).drop_duplicates()
   .assign(n=lambda d: d.groupby(cols+['variable']).cumcount().add(1))
   .pivot(index=cols, columns=['variable', 'n'], values='value')
)

# now we flatten the columns MultiIndex and reset the index
out.columns = out.columns.map(lambda x: f'{x[0]}_{x[1]}')
out.reset_index(inplace=True)

print(out)

输出结果：

     foo      tak bar_1 bar_2 nix_1 nix_2
0    one     dgad     B     A     Z   NaN
1  three   adgadg     C   NaN     Z     G
2    two  ogfagas     C     A     G   NaN

在使用了 melt、drop_duplicates 和去重之后的中间结果：

# (df.melt(cols).drop_duplicates()
#    .assign(n=lambda d: d.groupby(cols+['variable']).cumcount().add(1))
# )

      foo      tak variable value  n
0     one     dgad      bar     B  1
2     one     dgad      bar     A  2
3     two  ogfagas      bar     C  1
4     two  ogfagas      bar     A  2
6   three   adgadg      bar     C  1
9     one     dgad      nix     Z  1
12    two  ogfagas      nix     G  1
15  three   adgadg      nix     Z  1
16  three   adgadg      nix     G  2

回答于 2025-04-14 由 Python大师

分享举报

代码

我觉得用 apply 加上 axis=1 和 pd.Series 的组合会让代码运行得很慢。

可以使用下面的代码来加快速度：

cols = ['bar', 'nix']

out = pd.concat(
    [df.drop(cols, axis=1)] + 
    [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in cols], 
    axis=1
)

输出结果

     foo      tak bar0  bar1 nix0  nix1
0    one     dgad    B     A    Z  None
1  three  ogfagas    C     A    G  None
2    two   adgadg    C  None    Z     G

用下面的例子来检查一下

文档中的数据框（df）被乘以100,000，生成了一个新的数据框。

df = pd.concat([df] * 100000).reset_index(drop=True)

使用 apply，axis=1 加上 pd.Series

需要82.65982秒

我提供的解决方案

只需要0.09070秒

回答于 2025-04-14 由 Python大师

分享举报

将DataFrame中一列的唯一值扩展为多列

3 个回答

撰写回答