快速优雅地转换包含重复值的列的方法是什么？

3 投票

1 回答

57 浏览

提问于 2025-04-13 17:01

假设我们有一个名为 df 的 pandas 数据框，其中有一列叫 'A'，并且我们有一个比较复杂的转换函数：

def transform_a_to_b(a):
    ...
    return b

如果我们想用这个转换函数来创建一列 'B'，基于 'A'，我们可以这样做：

df['B'] = df['A'].apply(lambda x: transform_a_to_b(a))

如果这个转换函数的执行时间比较长，并且在 'A' 列中有很多重复的值，而这个转换函数总是把相同的重复值映射到相同的结果值 'B'，那么有没有更好的方法来处理呢？另外，假设数据框中还有其他列，所以我希望能把这些值映射回原数据框的每一行。

我想出了以下解决方案，但我觉得应该还有更简单的方法。

transform_counts = 0
def transform_a_to_b(a):
    global transform_counts
    # Keep count of how many times this was called
    transform_counts += 1

    return 2 * a

# Test dataframe with several duplicates
df = pd.DataFrame({
    'A': [1, 3, 2, 2, 3, 3, 2, 3, 1, 1, 1],
})

# My solution:
# Perform transformation only 3 times for the 3 unique A values and preserve order
df = df.merge(
    df['A'].drop_duplicates().apply(lambda a: pd.Series(
        data=[a, transform_a_to_b(a)],
        index=['A', 'B'],
    )),
    on='A',
    how='left',
)

当函数 transform_counts 的结果是 3，而数据框 df 如下所示：

我不反对使用缓存，如果这样更简单，但我不能改变原来的转换定义。

性能优化数据处理映射缓存 pandas 数据框转换函数重复值

1 个回答

你的方法挺好的，我建议可以用 map 加上 unique 来替代 merge 加 drop_duplicates。

df['B'] = df['A'].map({k: transform_a_to_b(k) for k in df['A'].unique()})

一个更符合 Python 风格的替代方法是给你的函数加上 cache：

from functools import cache

transform_counts = 0

@cache
def transform_a_to_b(a):
    global transform_counts
    # Keep count of how many times this was called
    transform_counts += 1
    return 2 * a

df = pd.DataFrame({
    'A': [1, 3, 2, 2, 3, 3, 2, 3, 1, 1, 1],
})

df['B'] = df['A'].map(transform_a_to_b)

print(df)

输出：

回答于 2025-04-13 由 Python大师

分享举报

快速优雅地转换包含重复值的列的方法是什么？

1 个回答

撰写回答