在单个数据帧上使用模糊Wuzzy逻辑，用最常出现的实例替换相似值

data=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "Count":['140','120','50','45','30','20','10','5']}) data

data_out=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "New_Column":["Deloitte",'Accenture','Accenture','Accenture','Ernst & young','Ernst & young','Tata Consultancy Services','Deloitte']}) data_out

2条回答

网友

1楼 · 编辑于 2024-06-02 06:25:47

使用模糊字符串匹配可以很容易地检测到大多数重复的公司，但是替换Ernst & young <-> EY实际上并不相似，这就是为什么我在这里忽略这个替换。此解决方案使用my libraryRapidFuzz，但您也可以使用FuzzyWuzzy实现类似的功能（只需编写一点代码，因为它没有ExtractIndexs处理器）

import pandas as pd
from rapidfuzz import process, utils

def add_deduped_employer_colum(data):
    values = data.values.tolist()
    employers = [employer for employer, _ in values]

    # preprocess strings beforehand (lowercase + remove punctuation),
    # so this is not done multiple times
    processed_employers = [utils.default_process(employer)
        for employer in employers]
    deduped_employers = employers.copy()

    replaced = []
    for (i, (employer, processed_employer)) in enumerate(
            zip(employers, processed_employers)):
        # skip elements that already got replaced
        if i in replaced:
            continue

        duplicates = process.extractIndices(
            processed_employer, processed_employers[i+1:],
            processor=None, score_cutoff=90, limit=None)

        for (c, _) in duplicates:
            deduped_employers[i+c+1] = employer
            """
            by replacing the element with an empty string the index from
            extractIndices stays correct but it can be skipped a lot 
            faster, since the compared strings will have very different
            lengths
            """
            processed_employers[i+c+1] = ""
            replaced.append(i+c+1)

    data['New_Column'] = deduped_employers

data=pd.DataFrame({
    'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'],
    "Count":['140','120','50','45','30','20','10','5']})

add_deduped_employer_colum(data)
print(data)

这将导致以下数据帧：

                    Employer Count                 New_Column
0                   Deloitte   140                   Deloitte
1                  Accenture   120                  Accenture
2    Accenture Solutions Ltd    50                  Accenture
3              Accenture USA    45                  Accenture
4              Ernst & young    30              Ernst & young
5                         EY    20                         EY
6  Tata Consultancy Services    10  Tata Consultancy Services
7                Deloitte Uk     5                   Deloitte

网友

2楼 · 编辑于 2024-06-02 06:25:47

我没有使用fuzzy，但可以提供以下帮助

资料

df=pd.DataFrame({'Employer':['Accenture','Accenture Solutions Ltd','Accenture USA', 'hjk USA', 'Tata Consultancy Services']})
df

您没有解释塔塔为何保留全名。因此，我假设它是特殊的，并掩盖它

m=df.Employer.str.contains('Tata')

然后我使用np.where替换第一个名字后面的任何内容

df['New_Column']=np.where(m, df['Employer'], df['Employer'].str.replace(r'(\s+\D+)',''))
df

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章