如何捕获文件中格式为（名称）：（句子）\n（名称）的所有句子：

2条回答

网友

1楼 · 编辑于 2024-04-25 20:00:57

您从未向我们提供模拟数据，因此我使用以下内容进行测试：

name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.

我们可以尝试使用以下模式进行匹配：

^\S+:\s+((?:(?!^\S+:).)+)

这可以解释为：

^\S+:\s+           match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+)  then match and capture everything up until the next name

请注意，这将处理最后一句话的边缘大小写，因为上面使用的否定的lookahead将不是真的，因此将捕获所有剩余的内容。你知道吗

代码示例：

import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)

['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']

Demo

网友

2楼 · 编辑于 2024-04-25 20:00:57

您可以使用先行表达式，该表达式在行首查找名称的相同模式，后跟冒号：

s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)

这将输出：

[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
 ('CALLER', ''),
 ('CRO', "You're welcome. Thank you.\n"),
 ('OPERATOR', 'Bye.\n'),
 ('CRO', 'Bye.\n'),
 ('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
 ('OPERATOR NEWELL', 'blah blah.\n'),
 ('GUY IN DESK', 'I speak words!')]

Demo

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何捕获文件中格式为（名称）：（句子）\n（名称）的所有句子：

Demo

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >