下面是我的df示例
name
A S BITO
A S KIGEL
A S NATURENERGI
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH
A S P GMBH
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA
A RITTER WITH SA
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO
A RITTER WITH MASCHINE SA CO
目的是用出现次数最多的唯一值替换名称
下面是唯一值的列表
name occurences
A S BITO 1
A S KIGEL 1
A S NATURENERGI 1
A S NATURENERGIE 2
A S P BU SERVICE POWER P 2
A S P BU SERVICE POWER PETER GMBH 1
A S P GMBH 1
A RESE LAND 1
A RITTER WITH SA 3
A RITTER SA CO 4
A RITTER WITH MASCHINE 1
A RITTER WITH MASCHINE SA CO 2
正如您在DF中看到的,一些名称可以分组
然而,由于拼写错误,没有
所需的输出如下所示
name
A S BITO
A S KIGEL
A S NATURENERGIE
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P GMBH
A RESE LAND
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
下面是代码
df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')
然而,这可能不是解决这个问题的最好办法
因此,我考虑使用difflib并计算匹配分数
下一步是将名称替换为得分最高的匹配项
f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) #
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)
这种方法的缺点是,我将检索完全相同的名称
所以,如果有人有一些想法,非常感谢
我创建了一个自定义函数,该函数在一个系列中迭代映射:
下面是一个例子:
输出:
相关问题 更多 >
编程相关推荐