基于条件创建具有新列名的数据透视表

2024-05-16 00:21:27 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个地址数据框,比如

  user_id      AddressLine1     AddressLine2       City      address_type
0 u68472      PO Box 354         None             LOREDO      Mailing
1 u68472      154 Cedar Park     Apt. 39          LOREDO      Residential

我想创建一个新的df,每个用户id只有一行,但有新的列名指示是住宅还是邮寄(这将只保留相同的现有列名)。所以它看起来像:

user_id    AddressLine1 AddressLine2  City    RES_AddressLine1   RES_AddressLine2  Res_City
0 u68472    PO Box 354  None          LOREDO  154 Cedar Park     Apt. 39           LOREDO

真的不知道该怎么做。对熊猫来说是新的。谢谢


Tags: boxnoneidcitypark地址aptres
2条回答

您可以基于地址类型创建两个数据帧,然后在用户id上连接它们,并为重叠列设置rsuffix

mailing = df.loc[df.address_type == 'Mailing', :].set_index('user_id')

residential = df.loc[df.address_type == 'Residential', :].set_index('user_id')

new_df = mailing.join(residential, rsuffix='_RES')

输出:

        AddressLine1 AddressLine2    City address_type AddressLine1_RES AddressLine2_RES City_RES address_type_RES
user_id                                                                                                           
u68472    PO Box 354         None  LOREDO      Mailing   154 Cedar Park          Apt. 39   LOREDO      Residential

或者,您可以使用多级索引:

Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.10.2   An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([['u68472','PO Box 354',None,'LOREDO','Mailing'],^M
   ...:                   ['u68472','154 Cedar Park','Apt. 39','LOREDO','Residential']], ^M
   ...:                  columns = ['user_id','AddressLine1','AddressLine2','City','address_type'])

In [3]: df
Out[3]:
  user_id    AddressLine1 AddressLine2    City address_type
0  u68472      PO Box 354         None  LOREDO      Mailing
1  u68472  154 Cedar Park      Apt. 39  LOREDO  Residential

In [4]: new_df = df.set_index(['user_id', 'address_type']).unstack('address_type')

In [5]: new_df
Out[5]:
             AddressLine1                 AddressLine2                City
address_type      Mailing     Residential      Mailing Residential Mailing Residential
user_id
u68472         PO Box 354  154 Cedar Park         None     Apt. 39  LOREDO      LOREDO

In [6]: new_df.columns = new_df.columns.get_level_values(0).str.cat(new_df.columns.get_level_values('address_type').values, sep='_')

In [7]: new_df
Out[7]:
        AddressLine1_Mailing AddressLine1_Residential AddressLine2_Mailing AddressLine2_Residential City_Mailing City_Residential
user_id
u68472            PO Box 354           154 Cedar Park                 None                  Apt. 39       LOREDO           LOREDO

相关问题 更多 >