我们如何编写一个函数来获取重复值的行号和min(行号)?

2024-06-08 18:07:14 发布

您现在位置:Python中文网/ 问答频道 /正文

  name    job       id_number
0  krul    painter    125796 
1  tim     lawyer     789632
2  daisy   engg       256498
3  alex    dancer     456985
4  mandy   arch       456258
5  krul    painter    125796
6  tim     lawyer     789632
7  tim     lawyer     789632
8  tim     lawyer     789632
9  daisy   engg       256498
10 daisy   engg       256498

输出:

 dup_Index   min_index
    0            0
    5            0
    2            2
    9            2
   10            2
    6            6
    7            7
    8            8

Tags: nameidnumberjobarchtimpainteralex
2条回答

尽管我无法从这个问题中看出分组背后的意图,但如果您想查看唯一的事件及其(重复的)索引,您可以始终求助于分组

df.groupby(('name', 'job', 'id_number'),as_index=True).apply(lambda x: x.index.tolist())

输出:

name   job      id_number
alex   dancer   456985                [3]
daisy  engg     256498         [2, 9, 10]
krul   painter  125796             [0, 5]
mandy  arch     456258                [4]
tim    lawyer   789632       [1, 6, 7, 8]
dtype: object

然后可以应用各种查询来获取列表的长度和第一个列表

根据你需要它做什么,可能有更好的方法,比如看@Quang Hoang的答案

IIUC,duplicatedtransform('idxmin')表示最小行数:

(df[df.duplicated('id_number', keep=False)]
    .groupby('id_number')['id_number'].transform('idxmin')
    .sort_values()
 )

输出:

0     0
5     0
1     1
6     1
7     1
8     1
2     2
9     2
10    2
Name: id_number, dtype: int64

相关问题 更多 >