数据帧将字符串拆分为多列

2024-06-09 00:16:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我是熊猫框架的新手,我已经搜索了足够多的内容来解决我的问题,但在网上没有得到太多帮助

我有一个字符串列,如下所示,我想将其转换为单独的列。我这里的问题是我尝试过拆分它,但它没有按我需要的方式提供输出

*-----------------------------------------------------------------------------*
|  Total Visitor                                                              |
*-----------------------------------------------------------------------------*
|  2x Adult, 1x Adult + Audio Guide                                           |
|  2x Adult, 2x Youth, 1x Children                                            | 
|  5x Adult + Audio Guide, 1x Children + Audio Guide, 1x Senior + Audio Guide |
*-----------------------------------------------------------------------------*

下面是我用来分割字符串但没有给出预期输出的代码

df = data["Total Visitor"].str.split(",", n = 1, expand = True)

拆分字符串后,我的预期输出应如下表所示:

*----------------------------------------------------------------------------------------------------------------*
|  Adult    | Adult + Audio Guide    | Youth   | Children    | Children + AG        | Senior + AG                                                                       
*----------------------------------------------------------------------------------------------------------------*
|  2x Adult | 1x Adult + Audio Guide |    -    |       -     |    -                    | -  
|
|  2x Adult |          -             |2x Youth | 1x Children |    -                    | -                               
|      -    | 5x Adult + Audio Guide |    -    |      -      |1x Children + Audio Guide| 1x Senior + Audio Guide |
*----------------------------------------------------------------------------------------------------------------*

我该怎么做?任何帮助或指导都会很好


Tags: 字符串代码框架内容方式audioguidetotal
2条回答

以下是使用pandas方法的一种方法:

dstack = df['Total Visitor'].str.split(',', expand=True).stack().str.strip().to_frame()
dstack['cols'] = dstack[0].str.extract(r'\d+x\s(.*)')
df_out = dstack.set_index('cols', append=True)[0].reset_index(level=1, drop=True).unstack()
df_out

输出:

cols     Adult     Adult + Audio Guide     Children     Children + Audio Guide     Senior + Audio Guide     Youth
0     2x Adult  1x Adult + Audio Guide          NaN                        NaN                      NaN       NaN
1     2x Adult                     NaN  1x Children                        NaN                      NaN  2x Youth
2          NaN  5x Adult + Audio Guide          NaN  1x Children + Audio Guide  1x Senior + Audio Guide       NaN

其思想是创建字典列表,其中带有xregex-^\d+x\s+^是字符串的开头,\d+是一个或多个整数,\s+是一个或多个空格),并传递给DataFrame构造函数:

import re

L =[dict([(re.sub('^\d+x\s+',"",y),y) for y in x.split(', ')]) for x in df['Total Visitor']]

df = pd.DataFrame(L).fillna('-')
print (df)
      Adult     Adult + Audio Guide     Youth     Children  \
0  2x Adult  1x Adult + Audio Guide         -            -   
1  2x Adult                       -  2x Youth  1x Children   
2         -  5x Adult + Audio Guide         -            -   

      Children + Audio Guide     Senior + Audio Guide  
0                          -                        -  
1                          -                        -  
2  1x Children + Audio Guide  1x Senior + Audio Guide  

另一个类似的想法是x从dict的键中分离列名称:

L = [dict([(y.split('x ')[1], y) for y in x.split(', ')]) for x in df['Total Visitor']]

df = pd.DataFrame(L).fillna('-')

相关问题 更多 >