查找文本中的所有实例,最后一个单词也应该是使用regex for python进行搜索的开始

2024-05-16 00:04:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我无法找到正则表达式问题的解决方案。这实际上是一个关于这个帖子的后续问题: Find string between two substrings AND between string and the end of file

我创建了以下示例文本(在我的应用程序中,文本要长得多,并且有多个文件等):

Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1.

现在我想解析这个文本中的特定信息。我的兴趣是“记录”,所以记录后面的文本部分。具体记录的日期,我指的是2010年11月2日,以及早班、晚班或夜班的概念(所以日期应该是:'2010年9月2日1.早班')。我的问题是,文件中没有真正的一致性,所以有时一个日期有两个注释,有时只有一个注释。有时注释部分包含文本,有时不包含文本

我知道如何解析记录部分,但我不知道如何首先解析日期,然后解析注释部分。所以我想把问题一分为二。我的第一步是,把整个文件分成不同的日期部分。第二步:遍历所有日期部分以获取特定日期部分的注释(使用正则表达式)。然后我会制作一个包含特定日期的列表(如果我只想要特定的日期,就把它放在一个列单元格中,例如,我只需解析该日期部分的前13个字符)和与该日期相关的注释。例如:

列表=[02-08-2010 1.早班,[note1,note2],02-08-2010 2.晚班,[note1]等]

让我们把重点放在日期解析上,这样我的问题就清楚了。我使用以下代码:

date = r'Course\s+(.*?)(?:Course|$)'
date_list = re.findall(date, text, re.DOTALL)
for i in date_list: 
   print (i)
   print ('XXX')

输出为:

22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. XXX 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 XXX 22/09/2010 3. Nightduty Record This is a new note, i call it note 1. XXX

此输出缺少以下元素:

['Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less']

以及

['3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions']

我认为正则表达式不会把单词“Course”的结尾,而把als看作是一个新的So-to-say匹配的开始

如果有人能帮我就太好了:)可能我错过了什么


Tags: 文本isitcallthisrecordnoteduty
1条回答
网友
1楼 · 发布于 2024-05-16 00:04:14

将非捕获组更改为正向前瞻:

r'Course\s+(.*?)(?=Course|$)'
                 ^^

参见regex demo。一个展开的更快的变体是r'Course\s+([^C]*(?:C(?!ourse)[^C]*)*)'(参见demo

否则,重叠的子字符串将不匹配

Python demo

import re
rx = r"Course\s+(.*?)(?=Course|$)"
s = "Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1."
results = re.findall(rx, s, re.DOTALL)
for x in results:
    print(x)

输出:

22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. 
22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less 
22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 
3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record 
22/09/2010 3. Nightduty Record This is a new note, i call it note 1.

相关问题 更多 >