在Pandas DataFrame中汇总行

2 投票
2 回答
637 浏览
提问于 2025-04-18 00:06

我有以下这些数据行:

    ColumnID  MenuID  QuestionID  ResponseCount       RowID  SourceColumnID  SourceRowID  SourceVariationID
22        -2      -2   319276487             28  3049400354      3049400356   3049400365         3049400365
23        -2      -2   319276487             31  3049400354      3049400356   3049400365         3049400365
24        -2      -2   319276487             37  3049400354      3049400356   3049400365         3049400365
25        -2      -2   319276487             28  3049400353      3049400357   3049400365         3049400365
26        -2      -2   319276487             45  3049400353      3049400357   3049400365         3049400365
27        -2      -2   319276487             46  3049400353      3049400357   3049400365         3049400365
28        -2      -2   319276487             26  3049400353      3049400358   3049400365         3049400365
29        -2      -2   319276487             33  3049400353      3049400358   3049400365         3049400365
30        -2      -2   319276487             39  3049400353      3049400358   3049400365         3049400365
31        -2      -2   319276487             26  3049400353      3049400359   3049400365         3049400365

我想把这个数据表压缩一下,也就是把每个 RowID 和 SourceVariationID 的 ResponseCount 总和计算出来。

举个例子:

    ColumnID  MenuID  QuestionID  ResponseCount       RowID  SourceColumnID  SourceRowID  SourceVariationID
22        -2      -2   319276487             96  3049400354      3049400356   3049400365         3049400365
23        -2      -2   319276487             243  3049400353      3049400356   3049400365

这是我到目前为止想到的办法:

(Pdb) new_df = df.groupby(['RowID', 'SourceVariationID', 'SourceRowID']).sum()                                                                          
(Pdb) new_df['ColumnID'] = -2
(Pdb) new_df['MenuID'] = -2
(Pdb) pp new_df
                                          ColumnID  MenuID  QuestionID  ResponseCount  SourceColumnID
RowID      SourceVariationID SourceRowID                                                             
3031434948 3031434943        3031434943         -2      -2  3805083612            141     36377219262
           3031434945        3031434945         -2      -2  4439264214            237     42440089136

[2 rows x 5 columns]

2 个回答

0

假设你的其他列都是整数:

columns = df.columns.tolist()
columns.remove('ResponseCount')
columns.remove('RowID')
tempDf = df.groupby(['RowID'])[['ResponseCount']].sum()
tempDf = tempDf.join(df.groupby(['RowID'])[columns].min())
tempDf['RowID'] = tempDf.index

这是个快速的解决办法,但不一定是最好的!希望这对你有帮助。

2

你可以这样做:

print df
   ColumnID  MenuID  QuestionID  ResponseCount       RowID  SourceVariationID
0        -2      -2   319276487             28  3049400354         3049400365
1        -2      -2   319276487             31  3049400354         3049400365
2        -2      -2   319276487             37  3049400354         3049400365
3        -2      -2   319276487             28  3049400353         3049400365
4        -2      -2   319276487             45  3049400353         3049400365
5        -2      -2   319276487             46  3049400353         3049400365
6        -2      -2   319276487             26  3049400353         3049400365
7        -2      -2   319276487             33  3049400353         3049400365
8        -2      -2   319276487             39  3049400353         3049400365
9        -2      -2   319276487             26  3049400353         3049400365


def squash(group):
    x = group.iloc[1,:].drop(['RowID','SourceVariationID'])
    x['ResponseCount'] = group['ResponseCount'].sum()
    return x

print df.groupby(['RowID','SourceVariationID']).apply(squash)

                             ColumnID  MenuID  QuestionID  ResponseCount
RowID      SourceVariationID                                             
3049400353 3049400365               -2      -2   319276487            243
3049400354 3049400365               -2      -2   319276487             96

撰写回答