Pandas:删除重复记录,同时将其旧值保留在dataframe中以供引用

2024-06-16 09:37:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用pandas重写一段旧代码。我的数据框如下所示:

index stop_id   stop_name   stop_lat     stop_lon  stop_id2
0         A12     Some St  40.889248   -73.898583      None
1         A14     Some St  40.889758   -73.908573      None
2         B09     Some St  40.788924   -74.846576      None
3         A22     Some St  40.889248   -73.898583      None

注意stop_-lat和stop-lon对于stop-id“A12”和“A22”是重复的。在

我想在用删除的记录的stop_id更新stop_d2时删除重复的stop(stop_id='A22')。因此数据帧如下所示:

^{pr2}$

以前,我在字典中保存数据时曾执行过此任务:

d={'A12':['Some St', 40.889248, -73.898583, None],'A14': ['Some St', 40.889758, -73.908573, None],'B09':['Some St, 40.788924,-74.846576, None], 'A22':['Some St', 40.889248, -73.898583, None]}

if d['A12'][1]+d['A12'][2]==d['A22'][1]+d['A22'][2]:
   del d['A22']
   d['A12'][-1]='A22'

我想在熊猫身上做类似的任务。我知道如果我只使用: 测向=数据删除重复项(['stop_lat','stop_lon')

我将丢失重复记录,并且不会保留其id。我需要保留已删除停止的id以获得正确的元数据。在


Tags: 数据代码noneidpandasindexsomestop
2条回答

获取重复掩码

cols = ['stop_lat', 'stop_lon']
dups = df.duplicated(subset=cols)

带掩码的子集df

^{pr2}$

重复数据可以自己复制

first_dup = df[dups].drop_duplicates(subset=cols)
first_dup = first_dup.set_index(cols).stop_id

相应分配

nodups.loc[first_dup.index, 'stop_id2'] = first_dup
nodups

enter image description here

new_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep='first')]

duplicates_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep = 'last')][['stop_lat', 'stop_lon', 'stop_id']]

new_df.merge(duplicates_df, how='left', on=['stop_lat, 'stop_lon'])

相关问题 更多 >