在分组内用另一个Polars DataFrame的最大值替换大于该值的Polars值

6 投票

1 回答

90 浏览

提问于 2025-04-14 15:37

我有两个数据框（DataFrame）：

import polars as pl

df1 = pl.DataFrame(
    {
        "group": ["A", "A", "A", "B", "B", "B"],
        "index": [1, 3, 5, 1, 3, 8],
    }
)

df2 = pl.DataFrame(
    {
        "group": ["A", "A", "A", "B", "B", "B"],
        "index": [3, 4, 7, 2, 7, 10],
    }
)

我想用第一个数据框（df1）中每个组的最大索引来限制第二个数据框（df2）中的index。这两个数据框中的组是相同的。

我希望得到的df2的输出结果是：

shape: (6, 2)
┌───────┬───────┐
│ group ┆ index │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ A     ┆ 3     │
│ A     ┆ 4     │
│ A     ┆ 5     │
│ B     ┆ 2     │
│ B     ┆ 7     │
│ B     ┆ 8     │
└───────┴───────┘

数据处理数据分析数据框 polars 数据替换分组操作最大值

1 个回答

你可以先对df1进行分组计算每组的最大值，然后用clip来处理df2：

out = df2.with_columns(
    pl.col('index').clip(
        upper_bound=df1.select(pl.col('index').max().over('group'))['index']
    )
)

输出结果：

shape: (6, 2)
┌───────┬───────┐
│ group ┆ index │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ A     ┆ 3     │
│ A     ┆ 4     │
│ A     ┆ 5     │
│ B     ┆ 2     │
│ B     ┆ 7     │
│ B     ┆ 8     │
└───────┴───────┘

另外，如果这两个数据框中的分组不一定完全相同，你可以先用group_by.max来计算最大值，然后再用join来对齐数据：

df1 = pl.DataFrame(
    {
        "group": ["A", "A", "A", "B", "B", "B"],
        "index": [1, 3, 5, 1, 3, 7],
    }
)

df2 = pl.DataFrame(
    {
        "group": ["A", "A", "A", "B", "B", "B", "B"],
        "index": [3, 4, 7, 2, 7, 8, 9],
    }
)

out = df2.with_columns(
    pl.col('index').clip(
        upper_bound=df2.join(df1.group_by('group').max(), on='group')['index_right']
    )
)

输出结果：

shape: (7, 2)
┌───────┬───────┐
│ group ┆ index │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ A     ┆ 3     │
│ A     ┆ 4     │
│ A     ┆ 5     │
│ B     ┆ 2     │
│ B     ┆ 7     │
│ B     ┆ 7     │
│ B     ┆ 7     │
└───────┴───────┘

回答于 2025-04-14 由 Python大师

分享举报

在分组内用另一个Polars DataFrame的最大值替换大于该值的Polars值

1 个回答

撰写回答