计算模糊字符串匹配中的最高分数

identity_no Pincode company_name IN2231 110030 AXN pvt Ltd UK654IN 897653 Aviva Intl Ltd SL1432 07658 Ship Incorporations LK0678G 120988 Oppo Mobiles Pvt Ltd

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1') df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2') from fuzzywuzzy import fuzz for index, row in df1.iterrows(): df1['match_acc']= fuzz.partial_ratio(df1['id_number'], df2['identity_no']) print(df1['match_acc'])

2条回答

网友

1楼 · 编辑于 2024-04-20 05:31:00

您可以将df1.id_number与df2.identity_no交叉连接，并计算每对的fuzz.ratio()（非部分比率），然后map()将最大分数返回到df1：

cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.partial_ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

#   id_number          company_name  match_acc
# 0   IN2231D           AXN pvt Ltd         92
# 1   UK654IN        Aviva Intl Ltd        100
# 2   SL1432H   Ship Incorporations         92
# 3   LK0678G  Oppo Mobiles pvt ltd        100
# 4   NG5678J             Nokia Inc         43

说明

{a1}的{}方法产生{}和{}的笛卡尔积：

cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')

#    id_number identity_no
# 0    IN2231D      IN2231
# 1    IN2231D     UK654IN
# 2    IN2231D      SL1432
# ...
# 17   NG5678J     UK654IN
# 18   NG5678J      SL1432
# 19   NG5678J     LK0678G

^{}每对的模糊计算器：

cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)

#    id_number identity_no  match_acc
# 0    IN2231D      IN2231         92
# 1    IN2231D     UK654IN         29
# 2    IN2231D      SL1432         15
# ...
# 17   NG5678J     UK654IN         14
# 18   NG5678J      SL1432          0
# 19   NG5678J     LK0678G         43

然后^{}将每id_number的最大得分转换为df1.match_acc：

df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

#   id_number          company_name  match_acc
# 0   IN2231D           AXN pvt Ltd         92
# 1   UK654IN        Aviva Intl Ltd        100
# 2   SL1432H   Ship Incorporations         92
# 3   LK0678G  Oppo Mobiles pvt ltd        100
# 4   NG5678J             Nokia Inc         43

网友

2楼 · 编辑于 2024-04-20 05:31:00

您可以使用fuzzywuzzy的process函数进行一对多操作。另外，使用rapidfuzz代替fuzzywuzzy，后者具有相同的功能，但它基于字符串算法执行一些预处理以提供更快的结果

pip install rapidfuzz

# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process #  > Use this for drastic exponential execution time improvements

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


for index, row in df1.iterrows():
    #extractOne will automatically extract the best one from the list of choices
    # you can provide which fuzzywuzzy scorer to use as well

    df1['match_acc']= process.extractOne(query=row['id_number'], choices=df2['identity_no'].tolist(), scorer=fuzz.partial_ratio)
print(df1['match_acc'])

说明

相关问题更多 >

编程相关推荐

热门问题

热门文章