Pandas数据框的行聚合

2 投票

3 回答

40 浏览

提问于 2025-04-14 17:02

用Python写一个函数，能够对pandas数据框中指定的一些列（这些列的名字在一个列表里）进行逐行汇总（比如求和、最小值、最大值、平均值等），而且要跳过那些NaN值（也就是缺失值），最好的方法是什么呢？

import pandas as pd
import numpy as np

df = pd.DataFrame({"col1": [1, np.NaN, 1],
                   "col2": [2, 2, np.NaN]})

def aggregate_rows(df, column_list, func):
    # Check if the specified columns exist in the DataFrame
    missing_columns = [col for col in column_list if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Columns not found in DataFrame: {missing_columns}")

    # Check if func is callable
    if not callable(func):
        raise ValueError("The provided function is not callable.")

    # Sum the specified columns
    agg_series = df[column_list].apply(lambda row: func(row.dropna()), axis=1)

    return agg_series

df["sum"] = aggregate_rows(df, ["col1", "col2"], sum)
df["max"] = aggregate_rows(df, ["col1", "col2"], max)
df["mean"] = aggregate_rows(df, ["col1", "col2"], lambda x: x.mean())
print(df)

这样得到的结果是（正如预期的那样）：

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0

但是如果一行全是NaN值，

df = pd.DataFrame({"col1": [1, np.NaN, 1, np.NaN],
                   "col2": [2, 2, np.NaN, np.NaN]})

结果就会是：

ValueError: max() arg is an empty sequence

那有什么好的办法来解决这个问题呢？

数据分析统计计算数据框缺失值处理行聚合汇总函数

3 个回答

如果你想忽略那些只有空值（NaN）的行，可以在进行汇总之前，先用 dropna 把它们删掉：

cols = ['col1', 'col2']
agg = ['sum', 'max', 'mean']

df[agg] = df[cols].dropna(how='all').agg(agg, axis=1)

如果你的数据中可能有重复的索引，可以使用更强大的方法，利用布尔索引：

cols = ['col1', 'col2']
agg = ['sum', 'max', 'mean']

m = df[cols].notna().any(axis=1)

df.loc[m, agg] = df.loc[m, cols].agg(agg, axis=1)

注意：你也可以给输出的列自定义名称，比如用 df.loc[m, ['A', 'B', 'C']] = ... 来指定。

输出结果：

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0
3   NaN   NaN  NaN  NaN   NaN

回答于 2025-04-14 由 Python大师

分享举报

你可以试着用 numpy.sum、numpy.max 或 numpy.mean 来代替 Python 自带的函数：

df["sum"] = aggregate_rows(df, ["col1", "col2"], np.sum)
df["max"] = aggregate_rows(df, ["col1", "col2"], np.max)
df["mean"] = aggregate_rows(df, ["col1", "col2"], np.mean)

print(df)

输出结果是：

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0
3   NaN   NaN  0.0  NaN   NaN

回答于 2025-04-14 由 Python大师

分享举报

你可以使用 df.agg 这个方法，设置参数 axis=1，然后把结果加到你原来的数据框（df）上，方法是通过 df.join：

out = df.join(df.agg(['sum', 'max', 'mean'], axis=1))

out

   col1  col2  sum  max  mean
0   1.0   2.0  3.0  2.0   1.5
1   NaN   2.0  2.0  2.0   2.0
2   1.0   NaN  1.0  1.0   1.0
3   NaN   NaN  0.0  NaN   NaN

回答于 2025-04-14 由 Python大师

分享举报

Pandas数据框的行聚合

3 个回答

撰写回答