基于现有列在数据框中创建新列

2024-05-19 01:07:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一本Python字典,如下所示:

ref_dict = {
"Company1" :["C1_Dev1","C1_Dev2","C1_Dev3","C1_Dev4","C1_Dev5",],
"Company2" :["C2_Dev1","C2_Dev2","C2_Dev3","C2_Dev4","C2_Dev5",],
"Company3" :["C3_Dev1","C3_Dev2","C3_Dev3","C3_Dev4","C3_Dev5",],
 }

我有一个名为df的熊猫数据框,其中一列如下所示:

    DESC_DETAIL
0   Probably task Company2 C2_Dev5
1   File system C3_Dev1
2   Weather subcutaneous Company2
3   Company1 Travesty C1_Dev3
4   Does not match anything 
...........

我的目标是在此数据框中添加两个额外的列,并将这些列命名为COMPANYDEVICECOMPANY列每行中的值将是字典中的公司键(如果它存在于DESC\u DETAIL列中),或者如果相应的设备存在于DESC\u DETAIL列中)。设备列中的值将只是描述细节列中的设备字符串。如果未找到匹配项,则对应的行为空。因此,最终输出将如下所示:

     DESC_DETAIL                        COMPANY         DEVICE
 0   Probably task Company2 C2_Dev5     Company2        C2_Dev5
 1   File system C3_Dev1                Company3        C3_Dev1
 2   Weather subcutaneous Company2      Company2        NaN
 3   Company1 Travesty C1_Dev3          Company1        C1_Dev3
 4   Does not match anything            NaN             NaN

我的尝试:

for key, value in ref_dict.items():
    df['COMPANY'] = df.apply(lambda row: key if row['DESC_DETAIL'].isin(key) else Nan, axis=1)

这显然是错误的,不起作用。我如何让它工作


Tags: dfdev1dev2dev4nandev5dev3desc
2条回答

可以使用正则表达式模式str.extract提取值:

import re

s = pd.Series(ref_dict).explode()

# extract company
df['COMPANY'] = df['DESC_DETAIL'].str.extract(
    f"({'|'.join(s.index.unique())})", flags=re.IGNORECASE)

# extract device
df['DEVICE'] = df['DESC_DETAIL'].str.extract(
    f"({'|'.join(s)})", flags=re.IGNORECASE)

# fill missing company values based on device
df['COMPANY'] = df['COMPANY'].fillna(
    df['DEVICE'].str.lower().map(dict(zip(s.str.lower(), s.index))))

df

输出:

                      DESC_DETAIL   COMPANY   DEVICE
0  Probably task Company2 C2_Dev5  Company2  C2_Dev5
1             File system C3_Dev1  Company3  C3_Dev1
2   Weather subcutaneous Company2  Company2      NaN
3       Company1 Travesty C1_Dev3  Company1  C1_Dev3
4         Does not match anything       NaN      NaN

您还需要一个设备到公司字典,您可以从ref_dict轻松地构建它,如下所示:

dev_to_company_dict = {v:l[0] for l in zip(ref_dict.keys(), ref_dict.values()) for v in l[1]}

这样做很容易:

df['COMPANY'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(ref_dict.keys())))
df['COMPANY'].replace('', np.nan, inplace=True)
df['DEVICE'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(dev_to_company_dict.keys())))
df['DEVICE'].replace('', np.nan, inplace=True)
df['COMPANY'] = df['COMPANY'].fillna(df['DEVICE'].map(dev_to_company_dict))

输出:

                       DESC_DETAIL   COMPANY     DEVICE
0   Probably task Company2 C2_Dev5  Company2    C2_Dev5
1   File system C3_Dev1             Company3    C3_Dev1
2   Weather subcutaneous Company2   Company2        NaN
3   Company1 Travesty C1_Dev3       Company1    C1_Dev3
4   Does not match anything              NaN        NaN

相关问题 更多 >

    热门问题