使用类似字符串按列合并两个CSVFile

2024-05-08 13:12:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我开始使用熊猫,我遇到了一个我不知道如何解决的问题

我有两个数据帧

第一个包含汽车信息,包括汽车型号(列DESCR_MARCA_VEICULO)

df1
col1   col2   DESCR_MARCA_VEICULO
....   ....   'GM/CELTA 5 PORTAS SUPER'
....   ....   'VW/VOYAGE LS'
....   ....   'VW/GOL LS'
....   ....   'I/AUDI A4 2.0T FSI'
....   ....   'FIAT/UNO CS IE'

第二个包含两列de para,其中包含车型和与该车型关联的唯一ID,如下所示:

df2
ID     DESCR_MARCA_VEICULO
1      'GM - CELTA 5'
2      'VW - VOYAGE LS'
3      'VW - GOL LS'
4      'ACURA - INTEGRA GS 1.8'
5      'AUDI - 80 S2 AVANT'

而且它不一定遵循像用“-”替换“/”之类的模式

然而,我在DF1中有5000多种不同的车型(这使我无法逐案查看),我需要结合DF1和DF2,将ID列带到DF1(这将是一个合并)。但是,当我合并dfs时,由于字符串中的这些差异,没有匹配项

有没有办法通过DESCR_MARCA_VEICULO列中字符串之间的相似性来合并这些dfs

谢谢:)


Tags: 字符串idlsdf1vwdescr车型audi
1条回答
网友
1楼 · 发布于 2024-05-08 13:12:43

我建议您研究fuzzyfuzzy包的模糊匹配

一种方法是从字符串中删除除字母&;以外的所有字符;数字

df1['var'] = df1['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')
df2['var'] = df2['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')

m = pd.merge(df1,df2,on='var').drop('var',axis=1)

print(m)

    col1     col2 DESCR_MARCA_VEICULO_x     matches  ID DESCR_MARCA_VEICULO_y
0  ....     ....           VW/VOYAGE LS  VWVOYAGELS   2        VW - VOYAGE LS
1  ....     ....              VW/GOL LS     VWGOLLS   3           VW - GOL LS

然而,我们需要匹配几个缺少的键:让我们使用Erfan在这个post中的答案

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match, based on Levenshtein distance
    limit is the amount of matches that will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

print(fuzzy_merge(df1, df2, 'var', 'var', threshold=80))

    col1     col2      DESCR_MARCA_VEICULO                  var     matches
0  ....     ....   GM/CELTA 5 PORTAS SUPER  GMCELTA5PORTASSUPER    GMCELTA5
1  ....     ....              VW/VOYAGE LS           VWVOYAGELS  VWVOYAGELS
2  ....     ....                 VW/GOL LS              VWGOLLS     VWGOLLS
3  ....     ....        I/AUDI A4 2.0T FSI        IAUDIA420TFSI            
4  ....     ....            FIAT/UNO CS IE          FIATUNOCSIE   

相关问题 更多 >

    热门问题