在dataframe中将具有列表的列拆分为多个列

2024-06-12 09:32:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我知道在这个话题上有很多问题,但仍然:
我的输入:作为数据帧

 task                                            m_label
0  S101-10061  [Cecum Landmark, ICV, Comment, Appendiceal ori...
1  S101-10069  [Rectum RF, ICV, Cecum Landmark, TI, Comment, ...
2  S101-10078  [Appendiceal orifice, ICV, Cecum Landmark, Com...
3  S101-10088  [Cecum Landmark, ICV, Comment, Appendiceal ori...
4  S101-10100  [Transverse, Appendiceal orifice, ICV, Cecum L...
5  S101-10102  [Rectum RF, ICV, Cecum Landmark, Comment, TI, ...
6  S101-10133  [Rectum RF, Transverse, ICV, Cecum Landmark, C...
7  S101YGBgZ2                                          [Comment]

我想像df.m_label.str.split("",expand=True)一样拆分,但它返回NaN 也许df有问题?我从熊猫系列中得到它:m_lab_task=data.groupby(['task'])['m_label'].unique()。所以,在前面的步骤中,可能是正确的吗

所需输出:

      task       m_label1 m_label2 m_label3 m_label4 m_label5 m_label6
0   S101-10061  Cecum Landmark ICV Comment Appendiceal orifice
1   S101-10069  Rectum RF ICV Cecum Landmark TI Comment Transverse
2   S101-10078  Appendiceal orifice ICV Cecum Landmark Comment Transverse
 Rectum RF
   

Tags: dftaskcommenttilabelrflandmarkori
3条回答

当您将列表转换为数据帧字符串数据时,如果不进行分隔,将合并为单个数据以克服此问题,您必须在转换为数据帧之前插入逗号,如下所示

import pandas as pd
data={"task":["S101-10061","S101-10069","S101-10078","S101-10088","S101-10100","S101-10102","S101-10133","S101YGBgZ2"],
     "m_label":[['Cecum Landmark','ICV' ,'Comment' ,'Appendiceal orifice'],['Rectum RF','ICV','Cecum Landmark','TI','Comment','Transverse']
               ,['Appendiceal orifice' ,'ICV' ,'Cecum Landmark', 'Comment', 'Transverse','Rectum RF'],['Cecum Landmark', 'ICV', 'Comment', 'Appendiceal orifice'],
               ['Transverse' ,'Appendiceal orifice', 'ICV', 'Cecum Landmark', 'Comment'],['Rectum RF' ,'ICV' ,'Cecum Landmark', 'Comment' ,'TI' ,'Transverse','Appendiceal orifice'],
               ['Rectum RF', 'Transverse' ,'ICV' ,'Cecum Landmark', 'Comment'],['Comment']]}
data=pd.DataFrame(data)

dataframe应该是这样的

        task    m_label
0   S101-10061  [Cecum Landmark, ICV, Comment, Appendiceal ori...
1   S101-10069  [Rectum RF, ICV, Cecum Landmark, TI, Comment, ...
2   S101-10078  [Appendiceal orifice, ICV, Cecum Landmark, Com...
3   S101-10088  [Cecum Landmark, ICV, Comment, Appendiceal ori...
4   S101-10100  [Transverse, Appendiceal orifice, ICV, Cecum L...
5   S101-10102  [Rectum RF, ICV, Cecum Landmark, Comment, TI, ...
6   S101-10133  [Rectum RF, Transverse, ICV, Cecum Landmark, C...
7   S101YGBgZ2  [Comment]

输出代码

import numpy as np
data=pd.concat([data["task"],data["m_label"].apply(lambda x:pd.Series(x).add_prefix("m_label"))],axis=1).replace(np.nan," ")

task    m_label0    m_label1    m_label2    m_label3    m_label4    m_label5    m_label6
0   S101-10061  Cecum Landmark  ICV Comment Appendiceal orifice         
1   S101-10069  Rectum RF   ICV Cecum Landmark  TI  Comment Transverse  
2   S101-10078  Appendiceal orifice ICV Cecum Landmark  Comment Transverse  Rectum RF   
3   S101-10088  Cecum Landmark  ICV Comment Appendiceal orifice         
4   S101-10100  Transverse  Appendiceal orifice ICV Cecum Landmark  Comment     
5   S101-10102  Rectum RF   ICV Cecum Landmark  Comment TI  Transverse  Appendiceal orifice
6   S101-10133  Rectum RF   Transverse  ICV Cecum Landmark  Comment     
7   S101YGBgZ2  Comment     

使用str.findall并传递正则表达式以捕获由单个''包围的所有内容,然后应用pd.Series将它们转换为列

df=df.set_index('task')['m_label'].str.findall('\'(.*?)\'').apply(pd.Series)
df.columns = [f'm_label{i+1}' for i in df]

输出:

                       m_label1             m_label2        m_label3               m_label4    m_label5    m_label6             m_label7  
task                                                                                                                                       
S101-10061       Cecum Landmark                  ICV         Comment    Appendiceal orifice         NaN         NaN                  NaN   
S101-10069            Rectum RF                  ICV  Cecum Landmark                     TI     Comment  Transverse                  NaN   
S101-10078  Appendiceal orifice                  ICV  Cecum Landmark                Comment  Transverse   Rectum RF                  NaN   
S101-10088       Cecum Landmark                  ICV         Comment    Appendiceal orifice         NaN         NaN                  NaN   
S101-10100           Transverse  Appendiceal orifice             ICV         Cecum Landmark     Comment         NaN                  NaN   
S101-10102            Rectum RF                  ICV  Cecum Landmark                Comment          TI  Transverse  Appendiceal orifice   
S101-10133            Rectum RF           Transverse             ICV         Cecum Landmark     Comment         NaN                  NaN   
S101YGBgZ2              Comment                  NaN             NaN                    NaN         NaN         NaN                  NaN   
                

如果需要,您可以稍后重置索引,然后fillna('')

为了给pyguy的答案添加一些内容,如果您想“动态”重命名列,可以使用add_prefix()

df.set_index('task')['m_label'].str.findall('\'(.*?)\'').apply(pd.Series).add_prefix('m_label')

输出:

Out[27]: 
                  m_label0 m_label1  ... m_label4    m_label5
task                                 ...                     
S101-10061  Cecum Landmark      ICV  ...      NaN         NaN
S101-10069       Rectum RF      ICV  ...  Comment  Transverse

相关问题 更多 >