在Pandas中按前置0分组行

0 投票
1 回答
33 浏览
提问于 2025-04-14 18:23

我有一个需求,就是要把在A列前面有0的行进行分组,并把B列的值加起来。对于那些前面没有0的行,直接在输出中显示原样的内容。请问我该如何使用pandas数据框来实现这个功能呢?我在附上的截图中给出了示例分组和期望的输出。即使前面有20行也是要分组的。

enter image description here

可以重复使用的输入:

df = pd.DataFrame({'Column A': [0,0,1,1,1,0,-1,-1,0,0,0,1,0,0,0,0,-1],
                   'Column B': [5,3,4,2,1,3,7,5,10,2,3,4,5,5,5,5,5]
                  })

1 个回答

0

如果我理解得没错,你可以使用一个反向的 cumsum 来形成一个组,然后计算 groupby.transform('sum'),最后用 mask 来处理输出结果:

group = df.loc[::-1, 'Column A'].ne(0).cumsum()[::-1]
m = group.duplicated(keep='last')

out = df.join(df.groupby(group, as_index=False).transform('sum')
                .add_suffix(' output').mask(m))

输出结果:

    Column A  Column B  Column A output  Column B output
0          0         5              NaN              NaN
1          0         3              NaN              NaN
2          1         4              1.0             12.0
3          1         2              1.0              2.0
4          1         1              1.0              1.0
5          0         3              NaN              NaN
6         -1         7             -1.0             10.0
7         -1         5             -1.0              5.0
8          0        10              NaN              NaN
9          0         2              NaN              NaN
10         0         3              NaN              NaN
11         1         4              1.0             19.0
12         0         5              NaN              NaN
13         0         5              NaN              NaN
14         0         5              NaN              NaN
15         0         5              NaN              NaN
16        -1         5             -1.0             25.0

另外,如果你只想要 B 的总和作为输出:

group = df.loc[::-1, 'Column A'].ne(0).cumsum()[::-1]
m = group.duplicated(keep='last')

df.loc[~m, 'out'] = df.groupby(group)['Column B'].transform('sum')

或者,有一个稍微更高效的变体:

group = df.loc[::-1, 'Column A'].ne(0).cumsum()[::-1]
m = group.duplicated(keep='last')

df.loc[~m, 'out'] = df.groupby(group, sort=False)['Column B'].sum().to_numpy()

输出结果:

    Column A  Column B   out
0          0         5   NaN
1          0         3   NaN
2          1         4  12.0
3          1         2   2.0
4          1         1   1.0
5          0         3   NaN
6         -1         7  10.0
7         -1         5   5.0
8          0        10   NaN
9          0         2   NaN
10         0         3   NaN
11         1         4  19.0
12         0         5   NaN
13         0         5   NaN
14         0         5   NaN
15         0         5   NaN
16        -1         5  25.0

撰写回答