基于两列添加索引+排序列值+条件

individual cluster totalPrice totalTripDurationMinutes 0 9710556 1 180.82 140 1 9710556 0 202.32 145 2 9710556 0 180.82 140 3 9710535 7 729.44 460 4 9710535 7 729.44 640 5 9710535 7 702.60 355 6 9710535 7 685.82 300 7 9710535 7 685.82 480 8 9710535 7 669.44 520 9 9710535 7 669.44 580 10 9710535 7 669.44 700

individual dominationCount cluster totalPrice totalTripDurationMinutes 0 9710556 0 1 180.82 140 1 9710556 0 0 202.32 145 2 9710556 1 0 180.82 140 3 9710535 0 7 729.44 460 4 9710535 0 7 729.44 640 5 9710535 1 7 702.60 355 6 9710535 2 7 685.82 300 7 9710535 2 7 685.82 480 8 9710535 3 7 669.44 520 9 9710535 3 7 669.44 580 10 9710535 3 7 669.44 700

3条回答

网友

1楼 · 编辑于 2024-05-20 01:07:41

您可以定义一个名为check_price的函数：

def check_price(x):
    #sort values of the prices and get only unique elements
    prices = x.sort_values(ascending=False).unique()
    #find index of of each price in the sorted prices to get the dominated count
    dominate =  [np.where(prices==val)[0] for val in x]
    return dominate

然后使用groupby和transform

df['dominatedCount'] = df.groupby(['individual', 'cluster'])['totalPrice'].transform(check_price)
df

    individual  cluster totalPrice  totalTripDurationMinutes    dominatedCount
0   9710556       1        180.82      140                              0.0
1   9710556       0        202.32      145                              0.0
2   9710556       0        180.82      140                              1.0
3   9710535       7        729.44      460                              0.0
4   9710535       7        729.44      640                              0.0
5   9710535       7        702.60      355                              1.0
6   9710535       7        685.82      300                              2.0
7   9710535       7        685.82      480                              2.0
8   9710535       7        669.44      520                              3.0
9   9710535       7        669.44      580                              3.0
10  9710535       7        669.44      700                              3.0

网友

2楼 · 编辑于 2024-05-20 01:07:41

使用^{}和methos='dense'并减去1：

df['dominatedCount'] = (df.groupby(['individual', 'cluster'])['totalPrice']
                          .rank(ascending=False, method='dense')
                          .astype(int)
                          .sub(1))
print (df)
    individual  cluster  totalPrice  totalTripDurationMinutes  dominatedCount
0      9710556        1      180.82                       140               0
1      9710556        0      202.32                       145               0
2      9710556        0      180.82                       140               1
3      9710535        7      729.44                       460               0
4      9710535        7      729.44                       640               0
5      9710535        7      702.60                       355               1
6      9710535        7      685.82                       300               2
7      9710535        7      685.82                       480               2
8      9710535        7      669.44                       520               3
9      9710535        7      669.44                       580               3
10     9710535        7      669.44                       700               3

网友

3楼 · 编辑于 2024-05-20 01:07:41

这里有一个非常复杂的方法：

result = df.merge(df.merge(df.merge(df[['individual',
                                        'cluster',
                                        'totalPrice']].drop_duplicates(),
                                    on=['individual',
                                        'cluster'],
                                    suffixes=('',
                                              '_new'),
                                    how='left'))
                    .query('totalPrice<totalPrice_new')
                    .drop('totalPrice_new',
                          axis=1)
                    .drop_duplicates()
                    .groupby(['individual',
                              'cluster',
                              'totalPrice'],
                             as_index=False)
                    .count()
                    .rename(columns={'totalTripDurationMinutes': 'dominationCount'}),
                  how='left', on=['individual', 'cluster', 'totalPrice']).fillna(0)

结果是：

    individual  cluster  totalPrice  totalTripDurationMinutes  dominationCount
0      9710556        1      180.82                       140              0.0
1      9710556        0      202.32                       145              0.0
2      9710556        0      180.82                       140              1.0
3      9710535        7      729.44                       460              0.0
4      9710535        7      729.44                       640              0.0
5      9710535        7      702.60                       355              1.0
6      9710535        7      685.82                       300              2.0
7      9710535        7      685.82                       480              2.0
8      9710535        7      669.44                       520              3.0
9      9710535        7      669.44                       580              3.0
10     9710535        7      669.44                       700              3.0

相关问题更多 >

编程相关推荐

热门问题

热门文章