Python pandas 按组累积和及百分比

5 投票
1 回答
4591 浏览
提问于 2025-04-18 16:50

给定以下的数据表 df:

      app       platform    uuid                              minutes
0     1         0  a696ccf9-22cb-428b-adee-95c9a97a4581       67
1     2         0  8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d        1
2     2         1  40AD6CD1-4A7B-48DD-8815-1829C093A95C       13
3     1         0  26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f        2
4     2         0  34271596-eebb-4423-b890-dc3761ed37ca        8
5     3         1  C57D0F52-B565-4322-85D2-C2798F7CA6FF       16
6     2         0      245501ec2e39cb782bab1fb02d7813b7        1
7     3         1  DE6E4714-5A3C-4C80-BD81-EAACB2364DF0       30
8     3         0  f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e       10
9     2         0  9c08c860-7a6d-4810-a5c3-f3af2a3fcf66      470
10    3         1      19fdaedfd0dbdaf6a7a6b49619f11a19        3
11    3         1  AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B       58
12    2         0  4eb1024b-c293-42a4-95a2-31b20c3b524b       24
13    3         1  8E0B0BE3-8553-4F38-9837-6C907E01F84C        7
14    3         1  E8B2849C-F050-4DCD-B311-5D57015466AE      465
15    2         0  ec7fedb6-b118-424a-babe-b8ffad579685      266
16    1         0  7e302dcb-ceaf-406c-a9e5-66933d921064      184
17    2         0      f786528ded200c9f553dd3a5e9e9bb2d       10
18    3         1  1E291633-AF27-4DFB-8DA4-4A5B63F175CF       13
19    2         0  953a525c-97e0-4c2f-90e0-dfebde3ec20d     2408`

我将对它进行分组:

y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])

                                                   minutes
app platform uuid                                         
1   0        7e302dcb-ceaf-406c-a9e5-66933d921064      184
             a696ccf9-22cb-428b-adee-95c9a97a4581       67
             26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f        2
2   0        953a525c-97e0-4c2f-90e0-dfebde3ec20d     2408
             9c08c860-7a6d-4810-a5c3-f3af2a3fcf66      470
             ec7fedb6-b118-424a-babe-b8ffad579685      266
             4eb1024b-c293-42a4-95a2-31b20c3b524b       24
             f786528ded200c9f553dd3a5e9e9bb2d           10
             34271596-eebb-4423-b890-dc3761ed37ca        8
             245501ec2e39cb782bab1fb02d7813b7            1
             8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d        1
    1        40AD6CD1-4A7B-48DD-8815-1829C093A95C       13
3   0        f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e       10
    1        E8B2849C-F050-4DCD-B311-5D57015466AE      465
             AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B       58
             DE6E4714-5A3C-4C80-BD81-EAACB2364DF0       30
             C57D0F52-B565-4322-85D2-C2798F7CA6FF       16
             1E291633-AF27-4DFB-8DA4-4A5B63F175CF       13
             8E0B0BE3-8553-4F38-9837-6C907E01F84C        7
             19fdaedfd0dbdaf6a7a6b49619f11a19            3

这样我就能按 uuid 得到它的分钟数,并且是按降序排列的。

现在,我将对每个应用/平台/uuid 的累计分钟数进行求和:

y.groupby(level=[0,1]).cumsum()


app platform uuid                                         
1   0        7e302dcb-ceaf-406c-a9e5-66933d921064      184
             a696ccf9-22cb-428b-adee-95c9a97a4581      251
             26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f      253
2   0        953a525c-97e0-4c2f-90e0-dfebde3ec20d     2408
             9c08c860-7a6d-4810-a5c3-f3af2a3fcf66     2878
             ec7fedb6-b118-424a-babe-b8ffad579685     3144
             4eb1024b-c293-42a4-95a2-31b20c3b524b     3168
             f786528ded200c9f553dd3a5e9e9bb2d         3178
             34271596-eebb-4423-b890-dc3761ed37ca     3186
             245501ec2e39cb782bab1fb02d7813b7         3187
             8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d     3188
    1        40AD6CD1-4A7B-48DD-8815-1829C093A95C       13
3   0        f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e       10
    1        E8B2849C-F050-4DCD-B311-5D57015466AE      465
             AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B      523
             DE6E4714-5A3C-4C80-BD81-EAACB2364DF0      553
             C57D0F52-B565-4322-85D2-C2798F7CA6FF      569
             1E291633-AF27-4DFB-8DA4-4A5B63F175CF      582
             8E0B0BE3-8553-4F38-9837-6C907E01F84C      589
             19fdaedfd0dbdaf6a7a6b49619f11a19          592

我的问题是:我该如何计算每个组的累计总和的百分比,也就是说,像这样:

app platform uuid                                         
1   0        7e302dcb-ceaf-406c-a9e5-66933d921064      184    0.26
             a696ccf9-22cb-428b-adee-95c9a97a4581      251    0.36
             26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f      253    0.36
...
...
...

1 个回答

3

你提到的0.26和0.36这两个数字是怎么来的不太清楚,不过假设它们只是占位符数字,如果你想要计算每个组的总百分比,可以这样做:

y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])

这样应该能得到正确的结果。

In [398]: y['running_pct'].head()
Out[398]: 
app  platform  uuid                                
1    0         7e302dcb-ceaf-406c-a9e5-66933d921064    0.727273
               a696ccf9-22cb-428b-adee-95c9a97a4581    0.992095
               26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f    1.000000
2    0         953a525c-97e0-4c2f-90e0-dfebde3ec20d    0.755332
               9c08c860-7a6d-4810-a5c3-f3af2a3fcf66    0.902760
Name: running_pct, dtype: float64

编辑:

根据评论,如果你想要提高一点性能,从0.14.1版本开始,这个方法会更快。

y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')

而且正如@Jeff提到的,在0.15.0版本中,这个方法可能会更快。

y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')

撰写回答