在Python中创建优先级

2024-06-09 23:18:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个很大的数据集,显示了每个人的学位以及获得学位的年份。此外,每个人都有一个相应的ID。我试图用学位完成的年份和学位完成的平均年龄来查找每个人的出生年份。数据集如下所示:

就平均年龄而言,我假设33岁完成博士学位,30岁完成硕士学位,22岁完成学士学位

person_id   degree                       degree_completion   year_of_birth
1           PhD                          2006                1973
1           BSc                          1999                1977
2           Ph.D.                        1995                1962
2           MBA                          2000                1970
2           B.A.                         1987                1965
3           Bachelor of Engineering      2005                1983
4           AB                           1997                1975
4           Doctor of Philosophy (PhD)   2003                1970                          

我已经创建了一个系统来计算每个人的出生年份,但我不知道如何创建一个优先级系统,以便它选择正确的出生年份,因为可以为每个人的不同程度计算不同的出生年份。我想要以下优先顺序:学士学位出生年份>;博士生出生年份>;硕士生年

我用groupby函数和Categorial数据类型尝试了很多方法。此外,数据集中有数百种不同的学位形式,因此我一直依赖于使用正则表达式来计算出生年份和创建优先级系统。这是我目前拥有的,但我找不到一种方法来将regex实现到:

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

file['edu_degree'] = pd.Categorical(file['edu_degree'], ordered=True, categories=[category1, category2, category3])

file.groupby('person_id')['edu_degree'].transform('max')

此外,这将是我期望的输出(出生年份根据优先级进行替换):

person_id   degree                       degree_completion   year_of_birth
1           PhD                          2006                1977
1           BSc                          1999                1977
2           Ph.D.                        1995                1965
2           MBA                          2000                1965
2           B.A.                         1987                1965
3           Bachelor of Engineering      2005                1983
4           AB                           1997                1975
4           Doctor of Philosophy (PhD)   2003                1975                          

Tags: of数据idab系统filepersonedu
3条回答

这是一个可能的解决方案,也许不是最优雅的解决方案,但仍能发挥作用

# define custom function to get the correct years
def find_best_year(df):

    cond1 = df['degree'].str.match(category1)
    cond2 = df['degree'].str.match(category2)
    cond3 = df['degree'].str.match(category3)
    
    if cond1.any():
        return df.loc[cond1, 'year_of_birth']
    elif cond2.any():
        return df.loc[cond2, 'year_of_birth']
    elif cond3.any():
        return df.loc[cond3, 'year_of_birth']
    else:
        raise ValueError("No condition was found.")


# create lookup table with best years
lookup_df = file\
    .groupby('person_id')\
    .apply(find_best_year)\
    .reset_index()\
    .drop(columns=['level_1'])
print(lookup_df)
#    person_id  year_of_birth
# 0          1           1977
# 1          2           1965
# 2          3           1983
# 3          4           1975


# desired output
file\
    .drop(columns=['year_of_birth'])\
    .merge(lookup_df, on='person_id', how='left')
#    person_id                      degree  degree_completion  year_of_birth
# 0          1                         PhD               2006           1977
# 1          1                         BSc               1999           1977
# 2          2                       Ph.D.               1995           1965
# 3          2                         MBA               2000           1965
# 4          2                        B.A.               1987           1965
# 5          3     Bachelor of Engineering               2005           1983
# 6          4                          AB               1997           1975
# 7          4  Doctor of Philosophy (PhD)               2003           1975

这里有一个想法,可能不是最优雅的(假设您的框架名为df):

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")
def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)
df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())
df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

一些解释:

步骤1:对正则表达式进行一点重新组织

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")

我已经调整了模式,以便(1)匹配列degree的完整条目,(2)包含更多可能性,(3)转义.很可能您必须进一步调整它和我将类别分组,并通过|将它们连接起来

步骤2:创建degree_cat列(=相应学位的类别)

def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)

我使用了category作为repl函数,它本质上用它们的类别替换匹配项。检查一下它是如何工作的。这个strip只是一个预防措施。示例的结果列如下所示:

0    2
1    1
2    2
3    3
4    1
5    1
6    1
7    2
Name: degree_cat, dtype: int64

步骤3:选择所需的出生年份

df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())

这里df按新列排序,按person_id分组,然后选择year_of_birth中的第一项(这是排序所需的年份)。您的样本结果:

   person_id  year_of_birth
0          1           1977
1          2           1965
2          3           1983
3          4           1975

步骤4:用所需的值替换year_of_birth中的值

df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

删除旧的year_of_birthdegree_cat列,因为它们不再需要了,然后沿person_id合并df上的df_year_of_birth数据帧,以重新创建右侧的year_of_birth

最终结果:

   person_id                      degree  degree_completion  year_of_birth
0          1                         PhD               2006           1977
1          1                         BSc               1999           1977
2          2                       Ph.D.               1995           1965
3          2                         MBA               2000           1965
4          2                        B.A.               1987           1965
5          3     Bachelor of Engineering               2005           1983
6          4                          AB               1997           1975
7          4  Doctor of Philosophy (PhD)               2003           1975

要应用正则表达式,可以创建一个函数(get_diploma)来逐个测试它们。理想情况下,按最可能的顺序排列(学士优先)

然后,您可以按person_id进行分组,并找到具有最高优先级的行(get_expected_age函数)

import re

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

diplomas = {category1: 'Bachelor', category2: 'PhD', category3: 'Master'}
ages = {'PhD': 33, 'Master': 30, 'Bachelor': 22}


def get_diploma(s):
    # for first matching regexp, return diploma
    for k in diplomas:
        if re.match(k, s):
            return diplomas[k]
    

        
df['degree_standardized'] = pd.Categorical(df['degree'].map(get_diploma),
                                           ordered=True,
                                           categories=['Master', 'PhD', 'Bachelor'])
# map the age from the standardized degree. NB. this could be fused with the previous step.
df['expected_age'] = df['degree_standardized'].map(ages)

def get_expected_age(d):
    # get degree with highest priority
    s = d.sort_values(by='degree_standardized').iloc[-1]
    d['year_of_birth'] = s['degree_completion']-s['expected_age']
    return d

df.groupby('person_id').apply(get_expected_age)

输出:

   person_id                      degree  degree_completion  year_of_birth degree_standardized expected_age
0          1                         PhD               2006           1977                 PhD           33
1          1                         BSc               1999           1977            Bachelor           22
2          2                       Ph.D.               1995           1965                 PhD           33
3          2                         MBA               2000           1965              Master           30
4          2                        B.A.               1987           1965            Bachelor           22
5          3     Bachelor of Engineering               2005           1983            Bachelor           22
6          4                          AB               1997           1975            Bachelor           22
7          4  Doctor of Philosophy (PhD)               2003           1975                 PhD           33

相关问题 更多 >