如果特定列中存在重复值,则删除整行

2024-05-15 01:46:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经阅读了CSV文件(包含客户的姓名和地址),并将数据分配到DataFrame表中。你知道吗

csv文件(或DataFrame表)的描述

DataFrame包含几行和7列

数据库示例

Client_id Client_Name Address1        Address3       Post_Code   City_Name              Full_Address                            

 C0000001     A       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000001     A       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000001     A       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000002     B       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     
 C0000002     B       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     
 C0000002     B       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     
 C0000003     C       11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ     
 C0000003     C       11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ     
 C0000003     C       11000051       9 RUE DU BRILL    L-3898       FOETZ           9 RUE DU BRILL,L-3898 ,FOETZ     
 C0000004     D       10000009    37 RUE DE LA GARE    L-7535      MERSCH       37 RUE DE LA GARE,L-7535, MERSCH     
 C0000005     E       10001998  RUE EDWARD STEICHEN    L-1855  LUXEMBOURG  RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     

到目前为止,我已经编写了以下代码来生成上述表:

代码是

import pandas as pd
import glob
Excel_file = 'Address.xlsx'
Address_Info = pd.read_excel(Excel_file)

# rename the columns name
Address_Info.columns = ['Client_ID', 'Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country'] 

# extract specfic columns into a new dataframe
Bin_Address= Address_Info[['Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']].copy()


# Clean existing whitespace from the ends of the strings
Bin_Address= Bin_Address.apply(lambda x: x.str.strip(), axis=1)  # ← added

# Adding a new column called (Full_Address) that concatenate address columns into one 
# for example   Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Bin_Address['Full_Address'] = Bin_Address[Bin_Address.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

Bin_Address['Full_Address']=Bin_Address[['Full_Address']].copy()


Bin_Address['latitude'] = 'None'
Bin_Address['longitude'] = 'None'

# Remove repetitive addresses
#Temp = list( dict.fromkeys(Bin_Address.Full_Address) )

# Remove repetitive values ( I do beleive the modification should be here)
Temp = list( dict.fromkeys(Address_Info.Client_ID) )

我想删除整行如果在客户id、客户名称和完整地址列中有重复的值,到目前为止代码没有显示任何错误,但同时,我还没有得到预期的结果(我确信修改将在所附代码的最后一行)

预期产量为

Client_id Client_Name Address1        Address3       Post_Code   City_Name              Full_Address                            
 C0000001     A       10000009    37 RUE DE LA GARE    L-7535     MERSCH           37 RUE DE LA GARE,L-7535, MERSCH            
 C0000002     B       10001998    RUE EDWARD STEICHEN  L-1855     LUXEMBOURG       RUE EDWARD STEICHEN,L-1855,LUXEMBOURG         
 C0000003     C       11000051    9 RUE DU BRILL       L-3898     FOETZ            9 RUE DU BRILL,L-3898 ,FOETZ         
 C0000004     D       10000009    37 RUE DE LA GARE    L-7535     MERSCH           37 RUE DE LA GARE,L-7535, MERSCH     
 C0000005     E       10001998    RUE EDWARD STEICHEN  L-1855     LUXEMBOURG       RUE EDWARD STEICHEN,L-1855,LUXEMBOURG     

Tags: nameclientbinaddressdelaedwarddu
2条回答

尝试:

df = df.drop_duplicates(['Client id', 'Client name', 'Full_Address'])

您可以使用pandas中名为dorp_duplicates()的内置方法。还有很多现成的选择,你可以申请。你知道吗

<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"])

如果要保留第一个值或最后一个值,还可以选择它何时重复。你知道吗

  <your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"], keep="first") # "first" or "last"

默认情况下,它将始终保持第一个值。你知道吗

相关问题 更多 >

    热门问题