在Python中创建优先级

3条回答

网友

1楼 · 编辑于 2024-06-09 23:18:14

这是一个可能的解决方案，也许不是最优雅的解决方案，但仍能发挥作用

# define custom function to get the correct years
def find_best_year(df):

    cond1 = df['degree'].str.match(category1)
    cond2 = df['degree'].str.match(category2)
    cond3 = df['degree'].str.match(category3)
    
    if cond1.any():
        return df.loc[cond1, 'year_of_birth']
    elif cond2.any():
        return df.loc[cond2, 'year_of_birth']
    elif cond3.any():
        return df.loc[cond3, 'year_of_birth']
    else:
        raise ValueError("No condition was found.")


# create lookup table with best years
lookup_df = file\
    .groupby('person_id')\
    .apply(find_best_year)\
    .reset_index()\
    .drop(columns=['level_1'])
print(lookup_df)
#    person_id  year_of_birth
# 0          1           1977
# 1          2           1965
# 2          3           1983
# 3          4           1975


# desired output
file\
    .drop(columns=['year_of_birth'])\
    .merge(lookup_df, on='person_id', how='left')
#    person_id                      degree  degree_completion  year_of_birth
# 0          1                         PhD               2006           1977
# 1          1                         BSc               1999           1977
# 2          2                       Ph.D.               1995           1965
# 3          2                         MBA               2000           1965
# 4          2                        B.A.               1987           1965
# 5          3     Bachelor of Engineering               2005           1983
# 6          4                          AB               1997           1975
# 7          4  Doctor of Philosophy (PhD)               2003           1975

网友

2楼 · 编辑于 2024-06-09 23:18:14

这里有一个想法，可能不是最优雅的（假设您的框架名为df）：

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")
def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)
df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())
df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

一些解释：

步骤1：对正则表达式进行一点重新组织

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")

我已经调整了模式，以便（1）匹配列degree的完整条目，（2）包含更多可能性，（3）转义.很可能您必须进一步调整它和我将类别分组，并通过|将它们连接起来

步骤2：创建degree_cat列（=相应学位的类别）

def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)

我使用了category作为repl函数，它本质上用它们的类别替换匹配项。检查一下它是如何工作的。这个strip只是一个预防措施。示例的结果列如下所示：

0    2
1    1
2    2
3    3
4    1
5    1
6    1
7    2
Name: degree_cat, dtype: int64

步骤3：选择所需的出生年份

df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())

这里df按新列排序，按person_id分组，然后选择year_of_birth中的第一项（这是排序所需的年份）。您的样本结果：

   person_id  year_of_birth
0          1           1977
1          2           1965
2          3           1983
3          4           1975

步骤4：用所需的值替换year_of_birth中的值

df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

删除旧的year_of_birth和degree_cat列，因为它们不再需要了，然后沿person_id合并df上的df_year_of_birth数据帧，以重新创建右侧的year_of_birth列

最终结果：

   person_id                      degree  degree_completion  year_of_birth
0          1                         PhD               2006           1977
1          1                         BSc               1999           1977
2          2                       Ph.D.               1995           1965
3          2                         MBA               2000           1965
4          2                        B.A.               1987           1965
5          3     Bachelor of Engineering               2005           1983
6          4                          AB               1997           1975
7          4  Doctor of Philosophy (PhD)               2003           1975

网友

3楼 · 编辑于 2024-06-09 23:18:14

要应用正则表达式，可以创建一个函数（get_diploma）来逐个测试它们。理想情况下，按最可能的顺序排列（学士优先）

然后，您可以按person_id进行分组，并找到具有最高优先级的行（get_expected_age函数）

import re

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

diplomas = {category1: 'Bachelor', category2: 'PhD', category3: 'Master'}
ages = {'PhD': 33, 'Master': 30, 'Bachelor': 22}


def get_diploma(s):
    # for first matching regexp, return diploma
    for k in diplomas:
        if re.match(k, s):
            return diplomas[k]
    

        
df['degree_standardized'] = pd.Categorical(df['degree'].map(get_diploma),
                                           ordered=True,
                                           categories=['Master', 'PhD', 'Bachelor'])
# map the age from the standardized degree. NB. this could be fused with the previous step.
df['expected_age'] = df['degree_standardized'].map(ages)

def get_expected_age(d):
    # get degree with highest priority
    s = d.sort_values(by='degree_standardized').iloc[-1]
    d['year_of_birth'] = s['degree_completion']-s['expected_age']
    return d

df.groupby('person_id').apply(get_expected_age)

输出：

   person_id                      degree  degree_completion  year_of_birth degree_standardized expected_age
0          1                         PhD               2006           1977                 PhD           33
1          1                         BSc               1999           1977            Bachelor           22
2          2                       Ph.D.               1995           1965                 PhD           33
3          2                         MBA               2000           1965              Master           30
4          2                        B.A.               1987           1965            Bachelor           22
5          3     Bachelor of Engineering               2005           1983            Bachelor           22
6          4                          AB               1997           1975            Bachelor           22
7          4  Doctor of Philosophy (PhD)               2003           1975                 PhD           33

相关问题更多 >

编程相关推荐

热门问题

热门文章