Polars自定义函数返回多个列

2 投票

3 回答

68 浏览

提问于 2025-04-14 16:27

_func这个函数是用来返回两列数据的：

from polars.type_aliases import IntoExpr, IntoExprColumn
import polars as pl

def _func(x: IntoExpr):
    x1 = x+1
    x2 = x+2
    return pl.struct([x1, x2])
    
df = pl.DataFrame({"test": np.arange(1, 11)})
df.with_columns(
    _func(pl.col("test")).alias(["test1", "test2"])
)

我尝试过用pl.struct来包装返回的值，但没有成功。

我期望的输出结果是：

shape: (10, 3)
test test1 test2
i32 i32 i32
1   2   3
2   3   4
3   4   5
4   5   6
5   6   7
6   7   8
7   8   9
8   9   10
9   10  11
10  11  12

数据处理自定义函数数据框架 polars 多列返回

3 个回答

为了全面考虑执行速度，我对提到的四种方法进行了测试，同时稍微增加了一些计算的复杂性。此外，我还想添加一种在 with_columns() 内部进行操作的第五种方法，而不是在 with_columns() 外部调用 unnest()：

from polars.type_aliases import IntoExpr, IntoExprColumn
from typing import Iterable
import polars as pl
import numpy as np

def func1(x: IntoExpr):
    x1 = (x + 1) ** 2 + x.sin()
    x2 = x.exp() / x.log()
    return x1, x2

def func2(x: IntoExpr, aliases: Iterable[str]):
    x1 = (x + 1) ** 2 + x.sin()
    x2 = x.exp() / x.log()
    return pl.struct(x1.alias(aliases[0]), x2.alias(aliases[1]))

np.random.seed(42)
df = pl.DataFrame({"test": np.random.normal(0, 0.1, 100_000_000)})
df_lazy = df.lazy()

急切执行：

使用 enumerate()：

%%timeit
df.with_columns(
    x.alias(f"test{i+1}") 
    for i, x in enumerate(func1(pl.col("test")))
    )

4.17 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用 zip()：

%%timeit
df.with_columns(
    x.alias(n) 
    for x, n in zip(
        func1(pl.col("test")),
        ["test1", "test2"]
        )
    )

4.09 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用 **dict 和 zip()：

%%timeit
df.with_columns(
    **dict(
        zip(
            ["test1", "test2"],
            func1(pl.col("test"))
            )
        )
    )

4.08 s ± 40.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用 struct()：

%%timeit
df.with_columns(
    func2(pl.col("test"), ["test1", "test2"]).alias("struct_col")
).unnest("struct_col")

4.07 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用 struct 和 struct.field：

%%timeit
df.with_columns(
    func2(pl.col("test"), ["test1", "test2"]).struct.field("test1"),
    func2(pl.col("test"), ["test1", "test2"]).struct.field("test2"),
)

4.52 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

懒惰执行：

所有查询的完成时间在 4.08秒 到 4.1秒 之间。它们之间的差异微不足道。

回答于 2025-04-14 由 Python大师

分享举报

你用的 pl.struct 方法是没问题的。不过现在你创建的结构列里有两个字段的名字是一样的。这个可以通过给其中一个字段起个不同的名字来解决，像下面这样。

def _func(x: IntoExpr):
    x1 = x+1
    x2 = x+2
    return pl.struct([x1, x2.name.prefix("another_")])

(
    df
    .with_columns(
        _func(pl.col("test")).alias("struct_col")
    )
)

这样就会创建一个名为 struct_col 的结构列，里面有两个字段，分别叫 test 和 another_test。接下来，你可以很方便地用 pl.DataFrame.unnest 把 struct_col 拆分成两个单独的列。

回答于 2025-04-14 由 Python大师

分享举报

我假设你不能或者不想改变这个函数，所以我们需要利用这个函数返回的一系列表达式。同时，我希望这个答案能够支持多于两列，这样我就不需要为每一列单独起别名。

你遇到的问题是，最后你得到了一系列同名的Expr，所以在执行之前，你需要在某个地方给它们重新命名。

解决方案可能取决于你打算给列使用什么样的名字。你可以这样做：

def _func(x: IntoExpr):
    x1 = x+1
    x2 = x+2
    return x1, x2

df.with_columns(
    x.alias(f"test{i+1}") for i,x in enumerate(_func(pl.col("test")))
)

# alternatevely
# df.with_columns(
#    x.name.suffix(f"{i+1}") for i,x in enumerate(_func(pl.col("test")))
#)

┌──────┬───────┬───────┐
│ test ┆ test1 ┆ test2 │
│ ---  ┆ ---   ┆ ---   │
│ i32  ┆ i32   ┆ i32   │
╞══════╪═══════╪═══════╡
│ 1    ┆ 2     ┆ 3     │
│ 2    ┆ 3     ┆ 4     │
│ 3    ┆ 4     ┆ 5     │
│ 4    ┆ 5     ┆ 6     │
│ 5    ┆ 6     ┆ 7     │
│ 6    ┆ 7     ┆ 8     │
│ 7    ┆ 8     ┆ 9     │
│ 8    ┆ 9     ┆ 10    │
│ 9    ┆ 10    ┆ 11    │
│ 10   ┆ 11    ┆ 12    │
└──────┴───────┴───────┘

如果想要自定义名字，可以使用zip：

f.with_columns(
    x.alias(n) for x, n in zip(_func(pl.col('test')), ["a","b"])
)

┌──────┬─────┬─────┐
│ test ┆ a   ┆ b   │
│ ---  ┆ --- ┆ --- │
│ i32  ┆ i32 ┆ i32 │
╞══════╪═════╪═════╡
│ 1    ┆ 2   ┆ 3   │
│ 2    ┆ 3   ┆ 4   │
│ 3    ┆ 4   ┆ 5   │
│ 4    ┆ 5   ┆ 6   │
│ 5    ┆ 6   ┆ 7   │
│ 6    ┆ 7   ┆ 8   │
│ 7    ┆ 8   ┆ 9   │
│ 8    ┆ 9   ┆ 10  │
│ 9    ┆ 10  ┆ 11  │
│ 10   ┆ 11  ┆ 12  │
└──────┴─────┴─────┘

或者你可以利用with_columns()接受**named_exprs作为参数的特点，把表达式列表转换成字典并展开：

df.with_columns(
    **dict(zip(['a','b'], _func(pl.col('test'))))
)

┌──────┬─────┬─────┐
│ test ┆ a   ┆ b   │
│ ---  ┆ --- ┆ --- │
│ i32  ┆ i32 ┆ i32 │
╞══════╪═════╪═════╡
│ 1    ┆ 2   ┆ 3   │
│ 2    ┆ 3   ┆ 4   │
│ 3    ┆ 4   ┆ 5   │
│ 4    ┆ 5   ┆ 6   │
│ 5    ┆ 6   ┆ 7   │
│ 6    ┆ 7   ┆ 8   │
│ 7    ┆ 8   ┆ 9   │
│ 8    ┆ 9   ┆ 10  │
│ 9    ┆ 10  ┆ 11  │
│ 10   ┆ 11  ┆ 12  │
└──────┴─────┴─────┘

回答于 2025-04-14 由 Python大师

分享举报

Polars自定义函数返回多个列

3 个回答

撰写回答