使用np.哪里根据条件在pandas df中创建新列

2024-06-16 14:44:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试创建一个标志变量(即,一个新的具有二进制值的列,比如1代表True,0代表False)——我尝试了np.whereas per this post)和{},但都没有用。在

与数据框。在哪里使用:

df.where(((df['MOSL_Rating'] == 'Highly Effective') & (df['MOTP_Rating'] == 'Developing')) | ((df['MOSL_Rating'] == 'Highly Effective') & (df['MOTP_Rating'] == 'Ineffective')) | ((df['MOSL_Rating'] == 'Effective') & (df['MOTP_Rating'] == 'Ineffective')) | ((df['MOSL_Rating'] == 'Ineffective') & (df['MOTP_Rating'] == 'Highly Effective')) | ((df['MOSL_Rating'] == 'Ineffective') & (df['MOTP_Rating'] == 'Effective')) | ((df['MOSL_Rating'] == 'Developing') & (df['MOTP_Rating'] == 'Highly Effective')), df['disp_rating'], 1, axis=1)

但这将返回ValueError: For argument "inplace" expected type bool, received type int.

如果我将代码从df['disp_rating'], 1, axis=1改为df['disp_rating'], True, axis=1,它将返回TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

我也尝试过np.where,但返回{}

我也读过this question,看起来很相似。但是,当我使用这里提供的解决方案时,它返回: KeyError: 'disp_rating'

如果我提前创建了变量(为了避免Key Error),我只会得到另一个关于其他东西的错误。在

我以为根据一些基本条件创建一个新变量会非常简单,但是我已经坚持了一段时间,并且没有真正取得任何进展,尽管阅读了文档和许多SO帖子。在

编辑:为了更加明确,我正在尝试创建一个新列(名为“disp_rating”),基于同一df中的其他两个列(“MOSL_rating”和“MOTP_rating”)中的值是否满足某些条件。我只有一个数据帧,所以我不想比较两个数据帧。 在SQL中我将使用CASE WHEN语句,在SAS中我将使用IF/THEN/ELSE语句。在

我的测向一般是这样的:

^{pr2}$

Tags: 数据truedfnp代表whereratingaxis
2条回答

我找不到where不起作用的原因,但有一个办法:

创建代码以创建df:

def make_row():
    import random
    dico = {"MOSL_Rating" : ['Highly Effective', 'Effective', 'Ineffective', 'Developing'],
            "MOTP_Rating" : ['Developing', 'Ineffective', 'Highly Effective', 'Effective', 'Highly Effective'],
           "disp_rating" : range(100)}

    row = {}
    for k in dico.keys():
        v = random.choice(dico[k])
        row[k] =v
    return row

def make_df(nb_row):
    import pandas as pd
    rows = [make_row() for i in range(nb_row)]
    return pd.DataFrame(rows)

我可以创建df:

^{pr2}$

还有第二个:

df2 = make_df(3)
df2
    MOSL_Rating MOTP_Rating disp_rating
0   Effective   Highly Effective    24
1   Effective   Developing  38
2   Highly Effective    Ineffective 16

然后我创建您的测试:

MOSL_high_efective   = df['MOSL_Rating'] == 'Highly Effective'
MOSL_efective        = df['MOSL_Rating'] == 'Effective'
MOSL_inefective      = df['MOSL_Rating'] == 'Ineffective'
MOSL_developing      = df['MOSL_Rating'] == 'Developing'

MOTP_high_efective   = df['MOTP_Rating'] == 'Highly Effective'
MOTP_efective        = df['MOTP_Rating'] == 'Effective'
MOTP_inefective      = df['MOTP_Rating'] == 'Ineffective'
MOTP_developing      = df['MOTP_Rating'] == 'Developing'

test1 = MOSL_high_efective & MOTP_developing
test2 = MOSL_high_efective & MOTP_inefective
test3 = MOSL_efective      & MOTP_inefective
test4 = MOSL_inefective    & MOTP_high_efective
test5 = MOSL_inefective    & MOTP_efective
test6 = MOSL_developing    & MOTP_high_efective

conditions  = test1 | test2 |  test3 | test4 | test5 | test6

然后在满足条件的情况下,用第二个数据帧更新第一个数据帧的值:

    lines_to_be_updates = df.loc[conditions].index.values
    df.loc[lines_to_be_updates, "disp_rating"] = df2[lines_to_be_updates]["disp_rating"]

df
    MOSL_Rating MOTP_Rating disp_rating
0   Highly Effective    Ineffective 24
1   Highly Effective    Highly Effective    71
2   Effective   Ineffective 16

您的逻辑过于复杂,可以通过set进行简化/优化。下面是一个演示。在

d = {frozenset({'H', 'D'}),
     frozenset({'H', 'I'}),
     frozenset({'E', 'I'})}

df['MOSL_MOTP'] = list(map(frozenset, zip(df['MOSL_Rating'], df['MOTP_Rating'])))
df['Result'] = np.where(df['MOSL_MOTP'].isin(d), 1, 0)

#    ID  Loc MOSL_Rating MOTP_Rating MOSL_MOTP  Result
# 0  12  54X           D           E    (E, D)       0
# 1  45  86I           D           I    (D, I)       0
# 2  98  65R           H           H       (H)       0
# 3  95  66R           H           D    (D, H)       1
# 4  96  67R           D           H    (D, H)       1
# 5  97  68R           E           I    (E, I)       1

相关问题 更多 >