使用类似字符串按列合并两个CSVFile

1条回答

网友

1楼 · 发布于 2024-05-19 23:03:08

我建议您研究fuzzyfuzzy包的模糊匹配

一种方法是从字符串中删除除字母&；以外的所有字符；数字

df1['var'] = df1['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')
df2['var'] = df2['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')

m = pd.merge(df1,df2,on='var').drop('var',axis=1)

print(m)

    col1     col2 DESCR_MARCA_VEICULO_x     matches  ID DESCR_MARCA_VEICULO_y
0  ....     ....           VW/VOYAGE LS  VWVOYAGELS   2        VW - VOYAGE LS
1  ....     ....              VW/GOL LS     VWGOLLS   3           VW - GOL LS

然而，我们需要匹配几个缺少的键：让我们使用Erfan在这个post中的答案

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match, based on Levenshtein distance
    limit is the amount of matches that will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

print(fuzzy_merge(df1, df2, 'var', 'var', threshold=80))

    col1     col2      DESCR_MARCA_VEICULO                  var     matches
0  ....     ....   GM/CELTA 5 PORTAS SUPER  GMCELTA5PORTASSUPER    GMCELTA5
1  ....     ....              VW/VOYAGE LS           VWVOYAGELS  VWVOYAGELS
2  ....     ....                 VW/GOL LS              VWGOLLS     VWGOLLS
3  ....     ....        I/AUDI A4 2.0T FSI        IAUDIA420TFSI            
4  ....     ....            FIAT/UNO CS IE          FIATUNOCSIE

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用类似字符串按列合并两个CSVFile

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >