Groupby NLAGEST基于一列,有条件地基于第二列

2024-05-13 00:05:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据集:

    individual  cluster  choice  benchmark_probabilities
0      9710535        0       0                 0.008647
1      9710535        2       0                 0.012558
2      9710535        2       0                 0.013894
3      9710535        1       0                 0.030648
4      9710535        1       0                 0.020298
5      9710535        1       0                 0.021444
6      9710535        1       0                 0.014804
7      9710535        5       0                 0.163837
8      9710535        5       0                 0.085191
9      9710535        2       0                 0.013272
10     9710535        2       0                 0.014684
11     9710535        2       0                 0.006987
12     9710535        2       0                 0.007387
13     9710535        2       0                 0.008940
14     9710535        3       0                 0.027746
15     9710535        3       0                 0.017345
16     9710535        3       0                 0.015545
17     9710535        4       0                 0.007449
18     9710535        3       0                 0.013382
19     9710535        4       0                 0.011559
20     9710535        3       0                 0.013091
21     9710535        4       0                 0.006438
22     9710535        4       0                 0.006089
23     9710535        4       0                 0.007768
24     9710535        4       0                 0.007348
25     9710535        2       0                 0.001479
26     9710535        5       0                 0.054764
27     9710535        5       0                 0.065420
28     9710535        5       0                 0.098600
29     9710535        5       0                 0.067577
30     9710535        6       0                 0.002158
31     9710535        6       0                 0.002041
32     9710535        6       0                 0.001694
33     9710535        6       0                 0.001602
34     9710535        7       0                 0.010075
35     9710535        7       0                 0.008076
36     9710535        7       0                 0.004485
37     9710535        7       0                 0.009090
38     9710535        7       0                 0.005834
39     9710535        5       0                 0.018973
40     9710535        7       0                 0.014945
41     9710535        7       0                 0.007159
42     9710535        6       0                 0.001624
43     9710535        6       0                 0.001535
44     9710535        5       0                 0.048068
45     9710535        7       0                 0.003548
46     9710540        0       1                 0.018614
47     9710540        0       0                 0.006515
48     9710540        0       0                 0.004040
49     9710540        1       0                 0.005489

我想做的是:

  1. individualcluster分组,然后选择每组中最前面的1个 基于benchmark_probabilities
  2. 根据individual选择前5名结果
  3. 如果一个individual具有少于5个唯一的cluster, 然后根据benchmark_probabilities填充剩余的空间 不考虑cluster

结果应该如下所示:

    individual  cluster  choice  benchmark_probabilities
0     9710535        1       0                 0.030648
1     9710535        5       0                 0.163837
2     9710535        3       0                 0.027746
3     9710535        8       0                 0.015682
4     9710535       11       1                 0.050787
5     9710540        0       0                 0.004040
6     9710540        1       0                 0.005489
7     9710540        0       0                 0.006515
8     9710540        0       1                 0.018614

我已经做了以下工作,这给了我第一和第二阶段,但不是第三阶段:

data.groupby(["individual", "cluster"])["benchmark_probabilities"].nlargest(1).groupby("individual").nlargest(5)

但结果不是我想要的,而且看起来也很难看:

individual  individual  cluster     
9710535     9710535     5        7      0.163837
                        11       75     0.050787
                        1        3      0.030648
                        3        14     0.027746
                        8        49     0.015682
9710540     9710540     0        98     0.018614
                        1        101    0.005489

任何帮助都将不胜感激


Tags: 数据data空间individualclusterbenchmarkgroupbychoice
1条回答
网友
1楼 · 发布于 2024-05-13 00:05:19

我认为您需要^{}而不是^{},因为这样可以避免nlargest列丢失,而且性能更好:

df0 = (data.groupby(["individual", "cluster"])["benchmark_probabilities"].nlargest(1)
           .groupby("individual").nlargest(5))
print (df0)
individual  individual  cluster    
9710535     9710535     5        7     0.163837
                        1        3     0.030648
                        3        14    0.027746
                        7        40    0.014945
                        2        10    0.014684
9710540     9710540     0        46    0.018614
                        1        49    0.005489
Name: benchmark_probabilities, dtype: float64

df1 = (data.sort_values(['individual','cluster','benchmark_probabilities'],
                         ascending=[True, True, False])
           .groupby(["individual", "cluster"]).head(1)
           .sort_values(['individual','benchmark_probabilities'], 
                        ascending=[True, False])
           .groupby("individual").head(5))
print (df1)
    individual  cluster  choice  benchmark_probabilities
7      9710535        5       0                 0.163837
3      9710535        1       0                 0.030648
14     9710535        3       0                 0.027746
40     9710535        7       0                 0.014945
10     9710535        2       0                 0.014684
46     9710540        0       1                 0.018614
49     9710540        1       0                 0.005489

然后仅筛选原始not in df1中的行并排序:

df2 = (data[~data.index.isin(df1.index)]
           .sort_values(['individual','benchmark_probabilities'], 
                        ascending=[True, False])
           )
#print (df2)

添加了tdf1并通过head获得前5名值:

df = (pd.concat([df1, df2])
        .groupby('individual').head(5)
        .sort_values('individual'))
print (df)
    individual  cluster  choice  benchmark_probabilities
7      9710535        5       0                 0.163837
3      9710535        1       0                 0.030648
14     9710535        3       0                 0.027746
40     9710535        7       0                 0.014945
10     9710535        2       0                 0.014684
46     9710540        0       1                 0.018614
49     9710540        1       0                 0.005489
47     9710540        0       0                 0.006515
48     9710540        0       0                 0.004040

相关问题 更多 >