Python Pandas - Fuzzy duplicates matching

2024-04-25 21:33:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个数据帧:

    make                model
0   allard              K1
1   alllard             J2
2   alpine renault      A110
3   alpine renualt      A310
4   amc (rambler        American
5   amc (rambler)       Marlin
6   aries               1907
7   ariès               1932
8   austin healey       3000
9   austin-healey       Sprite
10  benjamin et benova  Type B3
11  benjamin/benova     Type P2
12  benjmin/benova      Type P3

目标是要有第三列,其索引为具有最高模糊比率(最接近的模糊匹配)的行。在

如何有效地比较行?在


Tags: 数据makemodeltypek1j2alpineaustin
1条回答
网友
1楼 · 发布于 2024-04-25 21:33:31

使用^{},并假设make列的模糊性应该匹配,您可以尝试:

import pandas as pd
from itertools import product
from fuzzywuzzy.fuzz import ratio

df = pd.read_csv('data.csv')
keys = list(set(df['make']))
ratios = pd.DataFrame([{'k1': k1, 'k2': k2, 'ratio': ratio(k1, k2)} for k1, k2 in product(keys, keys) if k1 != k2])

def find_closest(make):
    return df[df['make'] == ratios.loc[ratios[ratios['k1'] == make]['ratio'].argmax(), 'k2']].index.values[0]

df['closest_index'] = df['make'].apply(find_closest)

print(df)

数据输出:

^{pr2}$

相关问题 更多 >