从嵌套Polars列中删除列表元素

2 投票

2 回答

76 浏览

提问于 2025-04-14 16:02

我想实现这样的效果：

(pl.Series(['abc_remove_def', 'remove_abc_def', 'abc_def_remove']).str.split('_')
   .map_elements(lambda x: [y for y in x if y != 'remove']).list.join('_')
)

有没有办法不使用比较慢的 map_elements 呢？我试过用 .list.eval 和 pl.element()，但是找不到可以通过名称（比如这里的'remove'这个词）来排除列表中元素的方法。

列表操作数据处理 polars 嵌套列

2 个回答

这里有两种方法（可以查看mozway的回答来了解性能对比）

第一种方法：gather（不推荐使用这个）

(pl.select(a=pl.Series(['abc_remove_def', 'remove_abc_def', 'abc_def_remove']).str.split('_'))
 .with_row_index('i')
 .group_by('i')
 .agg(pl.col('a').list.gather(pl.arg_where(pl.col('a').explode()!="remove")).first())
 .select('a')
)

或者

第二种方法：filter

(pl.select(a=pl.Series(['abc_remove_def', 'remove_abc_def', 'abc_def_remove']).str.split('_'))
 .with_row_index('i')
 .explode('a')
 .filter(pl.col('a')!='remove')
 .group_by('i')
 .agg('a')
 .select('a')
)

用第一种方法处理这组数据时，速度似乎稍微快一点，但如果数据量大了，就会变得非常慢。

回答于 2025-04-14 由 Python大师

分享举报

list.eval 和 filter 结合使用时，可以这样操作：

# list_eval
(pl
   .Series(['abc_remove_def', 'remove_abc_def', 'abc_def_remove']).str.split('_')
   .list.eval(pl.element().filter(pl.element() != 'remove'))
)

不过，正如 @jqurious 提到的，list.set_difference 是最简单和最快的方法：

# list_set_difference
(pl
   .Series(['abc_remove_def', 'remove_abc_def', 'abc_def_remove']).str.split('_')
   .list.set_difference(['remove'])
)

输出结果：

shape: (3,)
Series: '' [list[str]]
[
    ["abc", "def"]
    ["abc", "def"]
    ["abc", "def"]
]

时间和差异

包含3个项目的列表

包含100个项目且有很多重复项的列表

包含100个项目且没有重复项的列表

注意：这些时间不包括创建 Series 的时间。

另外，需要注意的是，list.set_difference 也会去掉重复的值。

例如在：

s = pl.Series(['abc_remove_abc_def', 'remove_abc_def']).str.split('_')

# output after set_difference
shape: (2,)
Series: '' [list[str]]
[
    ["abc", "def"]
    ["def", "abc"]
]

# output for the other approaches
shape: (2,)
Series: '' [list[str]]
[
    ["abc", "abc", "def"]
    ["abc", "def"]
]