如何基于单元格中的多行跨每行创建多个dataframe列?

2024-04-19 15:46:20 发布

您现在位置:Python中文网/ 问答频道 /正文

您好,我正在尝试基于[comment]列单元格中的多行在我的数据帧中创建多个列。源数据是一个.csv文件

这是我的数据集样本

+---------+-----------------------------------------+
| id      | comment                                 | 
+---------+-----------------------------------------+
| 123ab12 | DATE: 2/3/21 10:23:42 AM CST            |  
|         | STAGE: 1                                | 
|         | SCORE: 2,321                            |  
|         | NAME: Sally                             |  
|         | HOBBY: Swimming                         |  
|         | NOTES: But she doesn't like: sun, fish  |
+---------+-----------------------------------------+
| 123ab12 | DATE: 4/3/21 8:15:20 AM CST             |  
|         | STAGE: 1                                |  
|         | SCORE: 500                              |  
|         | NAME: Tom                               |  
|         | HOBBY: Running                          |  
|         | AGE: 26                                 |  
|         | NOTES: He needs new pair of sport shoes |
+---------+-----------------------------------------+

这就是我想要的

+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| id      | date                   | stage | score | name  | hobby    | age | notes                            |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 2/3/21 10:23:42 AM CST | 1     | 2,321 | Sally | Swimming |     | But she doesn't like: sun, fish  |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+
| 123ab12 | 4/3/21 8:15:20 AM CST  | 1     | 500   | Tom   | Running  | 26  | He needs new pair of sport shoes |
+---------+------------------------+-------+-------+-------+----------+-----+----------------------------------+

请注意:

  • 某些注释可能会为AGE增加一行
  • 结肠:可能在[comment]列的NOTES中出现两次,例如NOTES: bla bla bla : further sentence
  • ID可以复制
  • 有不同的ID和数千行

我最初的想法是:

  • 不知何故,使用正则表达式使用\n之前的换行符NOTES:作为列分隔符(但有时出现的AGE行似乎把它弄乱了,或者我的大脑不工作了……)

非常感谢你的帮助。谢谢大家!


Tags: 数据nameidagedatecommentamstage
1条回答
网友
1楼 · 发布于 2024-04-19 15:46:20

您可以使用str.extract和带有命名捕获组的正则表达式直接将提取的数据捕获到具有相应组名的数据帧列中(请参见this answer以回答关于pandas split list into columns with regex的问题)

您可以使用注释列的固定部分(即标签和换行符)作为锚定,并使AGE: 部分成为可选部分

正则表达式:

DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(\n)?NOTES:(?P<notes>[\s\S]+)

说明:

  1. 每一列都是通过ANCHOR: (?P<groupname>[\s\S]+)\n格式的模式提取的
  • ANCHOR: -这只是您的纯文本标签,即DATE: STAGE:
  • (?P<groupname>-这将启动一个命名的捕获组。<groupname>直接成为数据帧列名
  • [\s\S]+-贪婪匹配任何一系列(至少一个)字符(包括换行符,请参见this answer
  1. 对于age列,我们需要一些更改,因为AGE: 锚可能存在或不存在:
  • [\s\S]+?-紧靠AGE: 锚之前的最后一个组被延迟地匹配,否则它将贪婪地包括匹配中后面的整个AGE: 部分
  • (?:AGE: )?-由于AGE: 锚本身可能存在或不存在,因此它被封装在可选的非捕获组中
  • (?P<age>[\s\S]*?)-与其他捕获组不同,age的命名捕获组允许为空
  • (?:\n)?-尾随的换行符当然也是可选的,不应该被捕获

总之,无论AGE:部分在那里(https://regex101.com/r/tn6ixo/2/)还是不在那里(https://regex101.com/r/tn6ixo/1/),这都会在字符串中找到匹配项

完整示例:

输入CSV文件(comments.csv):

id;comments
123ab12;"DATE: 2/3/21 10:23:42 AM CST
STAGE: 1
SCORE: 2,321
NAME: Sally
HOBBY: Swimming
NOTES: But she doesn't like: sun, fish"
123ab12;"DATE: 4/3/21 8:15:20 AM CST
STAGE: 1
SCORE: 500
NAME: Tom
HOBBY: Running
AGE: 26
NOTES: He needs new pair of sport shoes"

Python脚本:

import pandas as pd

df = pd.read_csv('comments.csv', delimiter=';')

ef =  df['comments'].str.extract('DATE: (?P<date>[\s\S]+)\nSTAGE: (?P<stage>[\s\S]+)\nSCORE: (?P<score>[\s\S]+)\nNAME: (?P<name>[\s\S]+)\nHOBBY: (?P<hobby>[\s\S]+?)\n(?:AGE: )?(?P<age>[\s\S]*?)(?:\n)?NOTES:(?P<notes>[\s\S]+)', expand=True)

结果:

                     date stage  score   name     hobby age                              notes
0  2/3/21 10:23:42 AM CST     1  2,321  Sally  Swimming        But she doesn't like: sun, fish
1   4/3/21 8:15:20 AM CST     1    500    Tom   Running  26   He needs new pair of sport shoes

补充

请注意,这将生成一个数据帧,其中所有列都有dtype: object。你可能想convert some of your columns,例如

ef[['stage', 'age']] = ef[['stage', 'age']].apply(pd.to_numeric)
ef['score'] = ef['score'].str.replace(',', '').astype(int)
ef[['name', 'hobby', 'notes']] = ef[['name', 'hobby', 'notes']].astype('string')
ef['date'] = pd.to_datetime(ef['date'])

注意,后一个命令将无法自动正确识别您的时区,因为CST是一个不明确的时区。相反,您将得到天真的时间戳

要添加您的时区信息,您可以,例如add a timezone with ^{}

import pytz

ef['date'] = ef['date'].apply(lambda x: x.replace(tzinfo=pytz.timezone('America/Chicago')))

您还可以create a DatetimeIndexlocalize or convert to your timezone

相关问题 更多 >