按组聚合为列表加入新列

2 投票
4 回答
52 浏览
提问于 2025-04-14 17:19

我想把分组后的数据框中所有产品名称聚合到一个新列里,想要的格式是列表或者集合:

import pandas as pd  # 2.0.3

df = pd.DataFrame(
    {
        "customer_id": [1, 2, 3, 2, 1],
        "order_id": [1, 2, 3, 4, 1],
        "products": ["foo", "bar", "baz", "foo", "bar"],
        "amount": [1, 1, 1, 1, 1]
    }
)

print(df)
grouped = df.groupby(["customer_id", "order_id"])
df["product_order_count"] = grouped["amount"].transform("sum")
df["all_products"] = grouped["products"].agg(list).reset_index()
print(df)

虽然我参考了另一个问题(Pandas groupby: 如何获取字符串的并集),但是出现了异常:

Traceback (most recent call last):
  File "C:\temp\tt.py", line 15, in <module>
    df["all_orders"] = grouped["products"].agg(list).reset_index()
  File "c:\Users\foo\.venvs\kapa_monitor-38\lib\site-packages\pandas\core\frame.py", line 3940, in __setitem__
    self._set_item_frame_value(key, value)
  File "c:\Users\foo\.venvs\kapa_monitor-38\lib\site-packages\pandas\core\frame.py", line 4094, in _set_item_frame_value
    raise ValueError(
ValueError: Cannot set a DataFrame with multiple columns to the single column all_products

期望的输出(all_products,格式为 listset):

   customer_id  order_id products  amount  product_order_count all_products
0            1         1      foo       1                    2 'foo', 'bar'
1            2         2      bar       1                    1 'bar'
2            3         3      baz       1                    1 'baz'
3            2         4      foo       1                    1 'foo'
4            1         1      bar       1                    2 'foo', 'bar'

4 个回答

2

grouped["products"].agg(list).reset_index() 这个代码会返回一个数据表(DataFrame),所以你看到 ValueError 错误是因为你试图把一个包含多列的数据表放到一个列里。

你可以把它合并回原来的 df 中:

grouped = df.groupby(["customer_id", "order_id"])
df["product_order_count"] = grouped["amount"].transform("sum")
grouped = (
    grouped["products"]
    .agg(list)
    .reset_index()
    .rename(columns={"products": "all_products"})
)
df = pd.merge(df, grouped, on=["customer_id", "order_id"], how="left")
   customer_id  order_id products  amount  product_order_count all_products
0            1         1      foo       1                    2   [foo, bar]
1            2         2      bar       1                    1        [bar]
2            3         3      baz       1                    1        [baz]
3            2         4      foo       1                    1        [foo]
4            1         1      bar       1                    2   [foo, bar]
2

代码

# your code
grouped = df.groupby(["customer_id", "order_id"])
df["product_order_count"] = grouped["amount"].transform("sum")

# process
m = grouped['products'].agg(list).reset_index(name='all_products')
out = df.merge(m, how='left')

输出

在这里输入图片描述

2

你可以使用 transform 这个功能,配合一个返回和组长度相同的东西的函数:

df["all_products"] = grouped["products"].transform(lambda x: [list(x)]*len(x))

输出结果是:

   customer_id  order_id products  amount  product_order_count all_products
0            1         1      foo       1                    2   [foo, bar]
1            2         2      bar       1                    1        [bar]
2            3         3      baz       1                    1        [baz]
3            2         4      foo       1                    1        [foo]
4            1         1      bar       1                    2   [foo, bar]

或者你可以把字符串连接起来(我其实不太推荐在数据中使用列表):

df["all_products"] = grouped["products"].transform(','.join)

这样会得到:

   customer_id  order_id products  amount  product_order_count all_products
0            1         1      foo       1                    2      foo,bar
1            2         2      bar       1                    1          bar
2            3         3      baz       1                    1          baz
3            2         4      foo       1                    1          foo
4            1         1      bar       1                    2      foo,bar

撰写回答