在单个数据帧上使用模糊Wuzzy逻辑,用最常出现的实例替换相似值

2024-06-02 06:25:47 发布

您现在位置:Python中文网/ 问答频道 /正文

在python中应用模糊逻辑进行数据清理时,我面临一个问题。我的数据看起来像这样

data=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "Count":['140','120','50','45','30','20','10','5']})
data

我使用模糊逻辑来比较数据框中的值。最终输出应有第三列,其结果如下:

data_out=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "New_Column":["Deloitte",'Accenture','Accenture','Accenture','Ernst & young','Ernst & young','Tata Consultancy Services','Deloitte']})
data_out

因此,如果您看到,我希望出现次数较少的值有一个新条目作为一个新列,该列具有其类型中出现次数最多的值。这就是模糊逻辑有用的地方


Tags: 数据dataframedata逻辑pdusayoungltd
2条回答

使用模糊字符串匹配可以很容易地检测到大多数重复的公司,但是替换Ernst & young <-> EY实际上并不相似,这就是为什么我在这里忽略这个替换。此解决方案使用my libraryRapidFuzz,但您也可以使用FuzzyWuzzy实现类似的功能(只需编写一点代码,因为它没有ExtractIndexs处理器)

import pandas as pd
from rapidfuzz import process, utils

def add_deduped_employer_colum(data):
    values = data.values.tolist()
    employers = [employer for employer, _ in values]

    # preprocess strings beforehand (lowercase + remove punctuation),
    # so this is not done multiple times
    processed_employers = [utils.default_process(employer)
        for employer in employers]
    deduped_employers = employers.copy()

    replaced = []
    for (i, (employer, processed_employer)) in enumerate(
            zip(employers, processed_employers)):
        # skip elements that already got replaced
        if i in replaced:
            continue

        duplicates = process.extractIndices(
            processed_employer, processed_employers[i+1:],
            processor=None, score_cutoff=90, limit=None)

        for (c, _) in duplicates:
            deduped_employers[i+c+1] = employer
            """
            by replacing the element with an empty string the index from
            extractIndices stays correct but it can be skipped a lot 
            faster, since the compared strings will have very different
            lengths
            """
            processed_employers[i+c+1] = ""
            replaced.append(i+c+1)

    data['New_Column'] = deduped_employers

data=pd.DataFrame({
    'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'],
    "Count":['140','120','50','45','30','20','10','5']})

add_deduped_employer_colum(data)
print(data)

这将导致以下数据帧:

                    Employer Count                 New_Column
0                   Deloitte   140                   Deloitte
1                  Accenture   120                  Accenture
2    Accenture Solutions Ltd    50                  Accenture
3              Accenture USA    45                  Accenture
4              Ernst & young    30              Ernst & young
5                         EY    20                         EY
6  Tata Consultancy Services    10  Tata Consultancy Services
7                Deloitte Uk     5                   Deloitte

我没有使用fuzzy,但可以提供以下帮助

资料

df=pd.DataFrame({'Employer':['Accenture','Accenture Solutions Ltd','Accenture USA', 'hjk USA', 'Tata Consultancy Services']})
df

您没有解释塔塔为何保留全名。因此,我假设它是特殊的,并掩盖它

m=df.Employer.str.contains('Tata')

然后我使用np.where替换第一个名字后面的任何内容

df['New_Column']=np.where(m, df['Employer'], df['Employer'].str.replace(r'(\s+\D+)',''))
df

输出

enter image description here

相关问题 更多 >