计算模糊字符串匹配中的最高分数

2024-04-20 05:31:00 发布

您现在位置:Python中文网/ 问答频道 /正文

通过使用模糊字符串匹配查找两列值之间的最高精度百分比

我有两个数据帧,我试图在两个数据帧的特定列值之间使用模糊匹配

假设df1有5行,df2有4行,我想选取df1的每一行的值,并与df2的每一行匹配,找到最高的精度。假设DF1中的ROW1与DF2的所有行进行比较,因此无论从DF2的行具有最高的精度,我们都将其视为输出。对于df1中的每一行,应考虑相同的情况

输入数据:

Dataframe1

id_number  company_name        match_acc

IN2231D    AXN pvt Ltd
UK654IN    Aviva Intl Ltd
SL1432H    Ship Incorporations
LK0678G    Oppo Mobiles pvt ltd
NG5678J    Nokia Inc

Dataframe2

identity_no   Pincode   company_name

 IN2231        110030    AXN pvt Ltd
 UK654IN       897653    Aviva Intl Ltd
 SL1432        07658     Ship Incorporations
 LK0678G       120988    Oppo Mobiles Pvt Ltd

希望找到最高的准确率百分比,并在match_acc列中提交值

我目前使用的代码:

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


from fuzzywuzzy import fuzz 
for index, row in df1.iterrows():
   df1['match_acc']= fuzz.partial_ratio(df1['id_number'], df2['identity_no'])

print(df1['match_acc'])

我一直在使用模糊模糊,如果有任何其他方法,以及请建议

任何建议


Tags: 数据nameidnumbermatch精度companyacc
2条回答

您可以将df1.id_numberdf2.identity_no交叉连接,并计算每对的fuzz.ratio()(非部分比率),然后map()将最大分数返回到df1

cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.partial_ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

#   id_number          company_name  match_acc
# 0   IN2231D           AXN pvt Ltd         92
# 1   UK654IN        Aviva Intl Ltd        100
# 2   SL1432H   Ship Incorporations         92
# 3   LK0678G  Oppo Mobiles pvt ltd        100
# 4   NG5678J             Nokia Inc         43

说明

{a1}的{}方法产生{}和{}的笛卡尔积:

cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')

#    id_number identity_no
# 0    IN2231D      IN2231
# 1    IN2231D     UK654IN
# 2    IN2231D      SL1432
# ...
# 17   NG5678J     UK654IN
# 18   NG5678J      SL1432
# 19   NG5678J     LK0678G

^{}每对的模糊计算器:

cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)

#    id_number identity_no  match_acc
# 0    IN2231D      IN2231         92
# 1    IN2231D     UK654IN         29
# 2    IN2231D      SL1432         15
# ...
# 17   NG5678J     UK654IN         14
# 18   NG5678J      SL1432          0
# 19   NG5678J     LK0678G         43

然后^{}将每id_number的最大得分转换为df1.match_acc

df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

#   id_number          company_name  match_acc
# 0   IN2231D           AXN pvt Ltd         92
# 1   UK654IN        Aviva Intl Ltd        100
# 2   SL1432H   Ship Incorporations         92
# 3   LK0678G  Oppo Mobiles pvt ltd        100
# 4   NG5678J             Nokia Inc         43

您可以使用fuzzywuzzyprocess函数进行一对多操作。 另外,使用rapidfuzz代替fuzzywuzzy,后者具有相同的功能,但它基于字符串算法执行一些预处理以提供更快的结果

pip install rapidfuzz

# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process #  > Use this for drastic exponential execution time improvements

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


for index, row in df1.iterrows():
    #extractOne will automatically extract the best one from the list of choices
    # you can provide which fuzzywuzzy scorer to use as well

    df1['match_acc']= process.extractOne(query=row['id_number'], choices=df2['identity_no'].tolist(), scorer=fuzz.partial_ratio)
print(df1['match_acc'])

相关问题 更多 >