展开Polars DataFrame列而不重复其他列值

6 投票

2 回答

87 浏览

提问于 2025-04-14 16:46

作为一个简单的例子，假设我们有一个叫做 polars.DataFrame 的数据表，内容如下：

df = pl.DataFrame({"sub_id": [1,2,3], "engagement": ["one:one,two:two", "one:two,two:one", "one:one"], "total_duration": [123, 456, 789]})

sub_id	engagement	total_duration
1	one:one,two:two	123
2	one:two,two:one	456
3	one:one	789

接下来，我们要把“engagement”这一列进行拆分。

df = df.with_columns(pl.col("engagement").str.split(",")).explode("engagement")

这样处理后，我们得到的结果是：

sub_id	engagement	total_duration
1	one:one	123
1	two:two	123
2	one:two	456
2	two:one	456
3	one:one	789

为了可视化数据，我使用了 Plotly，代码如下：

import plotly.express as px
fig = px.bar(df, x="sub_id", y="total_duration", color="engagement")
fig.show()

生成的图表如下：

生成的图表

这基本上意味着订阅者 1 和 2 的总观看时间（total_duration）翻倍了。现在我想保持每个订阅者的总观看时间不变，但又想保留图表图例中显示的参与组，该怎么做呢？

data visualization dataframe plotly polars data transformation data manipulation engagement analysis total duration

2 个回答

假设你想把你的 total_duration 在 sub_id 的 engagement 之间平均分配，你可以在进行分割之前先调整一下它：

(
    df.with_columns(pl.col('engagement').str.split(','))
    .with_columns(
        pl.col('total_duration') / pl.col('engagement').list.len(),
    ).explode('engagement')
)

┌────────┬────────────┬────────────────┐
│ sub_id ┆ engagement ┆ total_duration │
│ ---    ┆ ---        ┆ ---            │
│ i64    ┆ str        ┆ f64            │
╞════════╪════════════╪════════════════╡
│ 1      ┆ one:one    ┆ 61.5           │
│ 1      ┆ two:two    ┆ 61.5           │
│ 2      ┆ one:two    ┆ 228.0          │
│ 2      ┆ two:one    ┆ 228.0          │
│ 3      ┆ one:one    ┆ 789.0          │
└────────┴────────────┴────────────────┘

回答于 2025-04-14 由 Python大师

分享举报

在polars中处理这个问题的一个方法是，将total_duration在同一个sub_id的参与行之间平均分配。为此，我们只需要将total_duration除以该sub_id的行数。

(
    df
    .with_columns(
        pl.col("engagement").str.split(",")
    )
    .explode("engagement")
    .with_columns(
        pl.col("total_duration") / pl.len().over("sub_id")
    )
)

shape: (5, 3)
┌────────┬────────────┬────────────────┐
│ sub_id ┆ engagement ┆ total_duration │
│ ---    ┆ ---        ┆ ---            │
│ i64    ┆ str        ┆ f64            │
╞════════╪════════════╪════════════════╡
│ 1      ┆ one:one    ┆ 61.5           │
│ 1      ┆ two:two    ┆ 61.5           │
│ 2      ┆ one:two    ┆ 228.0          │
│ 2      ┆ two:one    ┆ 228.0          │
│ 3      ┆ one:one    ┆ 789.0          │
└────────┴────────────┴────────────────┘

回答于 2025-04-14 由 Python大师

分享举报

展开Polars DataFrame列而不重复其他列值

2 个回答

撰写回答