在出现次数最多的唯一值上对唯一值进行分组

2024-05-29 06:23:36 发布

您现在位置:Python中文网/ 问答频道 /正文

下面是我的df示例

name
A S BITO 
A S KIGEL 
A S NATURENERGI
A S NATURENERGIE 
A S NATURENERGIE 
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH 
A S P GMBH  
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA    
A RITTER WITH SA
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO
A RITTER SA CO  
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO 
A RITTER WITH MASCHINE SA CO 

目的是用出现次数最多的唯一值替换名称

下面是唯一值的列表

name                                 occurences
A S BITO                             1
A S KIGEL                            1
A S NATURENERGI                      1
A S NATURENERGIE                     2
A S P BU SERVICE POWER P             2 
A S P BU SERVICE POWER PETER GMBH    1
A S P GMBH                           1
A RESE LAND                          1
A RITTER WITH SA                     3
A RITTER SA CO                       4
A RITTER WITH MASCHINE               1
A RITTER WITH MASCHINE SA CO         2

正如您在DF中看到的,一些名称可以分组
然而,由于拼写错误,没有

所需的输出如下所示

name
A S BITO 
A S KIGEL 
A S NATURENERGIE
A S NATURENERGIE 
A S NATURENERGIE 
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P 
A S P GMBH  
A RESE LAND
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO

下面是代码

df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')

然而,这可能不是解决这个问题的最好办法
因此,我考虑使用difflib并计算匹配分数
下一步是将名称替换为得分最高的匹配项

f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) # 
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)

这种方法的缺点是,我将检索完全相同的名称

所以,如果有人有一些想法,非常感谢


Tags: name名称dfservicewithsareplacegmbh
1条回答
网友
1楼 · 发布于 2024-05-29 06:23:36

我创建了一个自定义函数,该函数在一个系列中迭代映射:

import difflib

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():
        diz_map[s] = s.replace(" ", "")
        reverse_map[s.replace(" ", "")] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6), key=len)[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

下面是一个例子:

name = pd.Series(['A S BITO', 
'A S KIGEL',
'A S NATURENERGI',
'A S NATURENERGIE',
'A S NATURENERGIE',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER PETER GMBH',
'A S P GMBH',
'A RESE LAND',
'A RITTER WITH SA',
'A RITTER WITH SA', 
'A RITTER WITH SA',
'A RITTER SA CO',
'A RITTER SA CO', 
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER WITH MASCHINE',
'A RITTER WITH MASCHINE SA CO', 
'A RITTER WITH MASCHINE SA CO'])

similarity_replace(similarity_replace(name))

输出:

0                     A S BITO
1                    A S KIGEL
2              A S NATURENERGI
3              A S NATURENERGI
4              A S NATURENERGI
5     A S P BU SERVICE POWER P
6     A S P BU SERVICE POWER P
7     A S P BU SERVICE POWER P
8                   A S P GMBH
9                  A RESE LAND
10              A RITTER SA CO
11              A RITTER SA CO
12              A RITTER SA CO
13              A RITTER SA CO
14              A RITTER SA CO
15              A RITTER SA CO
16              A RITTER SA CO
17              A RITTER SA CO
18              A RITTER SA CO
19              A RITTER SA CO

相关问题 更多 >

    热门问题