按组聚合为列表加入新列
我想把分组后的数据框中所有产品名称聚合到一个新列里,想要的格式是列表或者集合:
import pandas as pd # 2.0.3
df = pd.DataFrame(
{
"customer_id": [1, 2, 3, 2, 1],
"order_id": [1, 2, 3, 4, 1],
"products": ["foo", "bar", "baz", "foo", "bar"],
"amount": [1, 1, 1, 1, 1]
}
)
print(df)
grouped = df.groupby(["customer_id", "order_id"])
df["product_order_count"] = grouped["amount"].transform("sum")
df["all_products"] = grouped["products"].agg(list).reset_index()
print(df)
虽然我参考了另一个问题(Pandas groupby: 如何获取字符串的并集),但是出现了异常:
Traceback (most recent call last):
File "C:\temp\tt.py", line 15, in <module>
df["all_orders"] = grouped["products"].agg(list).reset_index()
File "c:\Users\foo\.venvs\kapa_monitor-38\lib\site-packages\pandas\core\frame.py", line 3940, in __setitem__
self._set_item_frame_value(key, value)
File "c:\Users\foo\.venvs\kapa_monitor-38\lib\site-packages\pandas\core\frame.py", line 4094, in _set_item_frame_value
raise ValueError(
ValueError: Cannot set a DataFrame with multiple columns to the single column all_products
期望的输出(all_products
,格式为 list
或 set
):
customer_id order_id products amount product_order_count all_products
0 1 1 foo 1 2 'foo', 'bar'
1 2 2 bar 1 1 'bar'
2 3 3 baz 1 1 'baz'
3 2 4 foo 1 1 'foo'
4 1 1 bar 1 2 'foo', 'bar'
4 个回答
2
grouped["products"].agg(list).reset_index()
这个代码会返回一个数据表(DataFrame),所以你看到 ValueError
错误是因为你试图把一个包含多列的数据表放到一个列里。
你可以把它合并回原来的 df
中:
grouped = df.groupby(["customer_id", "order_id"])
df["product_order_count"] = grouped["amount"].transform("sum")
grouped = (
grouped["products"]
.agg(list)
.reset_index()
.rename(columns={"products": "all_products"})
)
df = pd.merge(df, grouped, on=["customer_id", "order_id"], how="left")
customer_id order_id products amount product_order_count all_products
0 1 1 foo 1 2 [foo, bar]
1 2 2 bar 1 1 [bar]
2 3 3 baz 1 1 [baz]
3 2 4 foo 1 1 [foo]
4 1 1 bar 1 2 [foo, bar]
2
你可以使用 transform
这个功能,配合一个返回和组长度相同的东西的函数:
df["all_products"] = grouped["products"].transform(lambda x: [list(x)]*len(x))
输出结果是:
customer_id order_id products amount product_order_count all_products
0 1 1 foo 1 2 [foo, bar]
1 2 2 bar 1 1 [bar]
2 3 3 baz 1 1 [baz]
3 2 4 foo 1 1 [foo]
4 1 1 bar 1 2 [foo, bar]
或者你可以把字符串连接起来(我其实不太推荐在数据中使用列表):
df["all_products"] = grouped["products"].transform(','.join)
这样会得到:
customer_id order_id products amount product_order_count all_products
0 1 1 foo 1 2 foo,bar
1 2 2 bar 1 1 bar
2 3 3 baz 1 1 baz
3 2 4 foo 1 1 foo
4 1 1 bar 1 2 foo,bar