将列值(文本数据)拆分/切片为多行

2024-06-10 11:06:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据帧

data = [
    ['ticket_1', '''2021-04-18 11:11:23 - Unknown
Part replaced

2021-04-18 09:03:10 - John Doe
Requires part change. Technician enroute

2021-04-17 19:30:25 - John Doe
Working on the issue'''], 
    ['ticket_2','''2021-04-17 19:13:09 - Unknown_2
Transferring to other resource

2021-04-17 19:12:34 - Unknown_2
Issue assigned''']
]

original dataframe

我想拆分注释列并执行以下步骤

1。提取日期时间并包含在另一列中

2。在另一列中包含日期时间后的字符串

3。根据评论栏中的多个日期时间重复票证数量

例如

data_1 = [
    ['ticket_1','''2021-04-18 11:11:23''','''Unknown
Part replaced'''],
    ['ticket_1', '''2021-04-18 09:03:10''', '''John Doe
Requires part change. Technician enroute'''],
    ['ticket_1','''2021-04-17 19:30:25''', '''John Doe
Working on the issue'''],
    ['ticket_2','''2021-04-17 19:13:09''','''Unknown_2
Transferring to other resource'''],
    ['ticket_2','''2021-04-17 19:12:34''','''Unknown_2
Issue assigned''']
]

desired output

我尝试过使用regex、explode、split和slice的组合,但没有取得丰硕的成果


Tags: dataon时间changejohnticketunknownworking
1条回答
网友
1楼 · 发布于 2024-06-10 11:06:45

假设数据帧具有以下结构:

  ticket_num                                                                                                                                                                      comments
0   ticket_1  2021-04-18 11:11:23 - Unknown\nPart replaced\n2021-04-18 09:03:10 - John Doe\nRequires part change. Technician enroute\n2021-04-17 19:30:25 - John Doe\nWorking on the issue
1   ticket_2                                                              2021-04-17 19:13:09 - Unknown_2\nTransferring to other resource\n2021-04-17 19:12:34 - Unknown_2\nIssue assigned

首先,使用Series.str.split查找datetime值的comments列中所有出现的字符串值,后跟任何字符,直到下一个datetime值或多行字符串的结尾,即\Z,并传递适当的正则表达式。您还需要传递re.DOTALL标志,因为它是一个多行字符串,然后explode这个新列上的数据帧,并将datetime和comment的值转换为两个单独的列,对它们应用pandas Series,并对axis=1执行串联:

df['extracted'] = (df['comments']
                    .str
                    .findall('(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*?(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}|\Z))',
                            flags=re.DOTALL)
                   )
df  = df.explode('extracted')
df = pd.concat([df, df['extracted'].apply(lambda x: pd.Series(x, index=['datetime',  'comment']))], axis=1)

现在,您可以删除不需要的列,也可以删除新的comment列以删除开头或结尾的任何连字符-和空格,如果需要,还可以将datetime列转换为datetime类型

输出:

>>> df[['ticket_num', 'datetime',  'comment']]

  ticket_num             datetime                                                comment
0   ticket_1  2021-04-18 11:11:23                             - Unknown\nPart replaced\n 
0   ticket_1  2021-04-18 09:03:10 - John Doe\nRequires part change. Technician enroute\n 
0   ticket_1  2021-04-17 19:30:25                       - John Doe\nWorking on the issue 
1   ticket_2  2021-04-17 19:13:09          - Unknown_2\nTransferring to other resource\n 
1   ticket_2  2021-04-17 19:12:34                            - Unknown_2\nIssue assigned 

相关问题 更多 >