正则表达式在多行上搜索文本

2024-04-29 11:06:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用一个regex语句来提取两个已知短语之间的特定文本块,这些短语将在其他文档中重复,并删除所有其他内容。这几句话将被传递到其他函数中

我的问题似乎是,当我使用一个regex语句时,它在同一行上有我搜索的单词,它是有效的。如果他们在不同的线路上,我会得到:

print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'

我希望将来的报告在不同的点上有换行符,这取决于以前写的内容-有没有办法先删除所有换行符来准备文本,或者让我的正则表达式在搜索时忽略这些换行符

任何帮助都会很好,谢谢

import fitz
import re

doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
    text_list.append(page.getText())
    #print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"

match = re.search(pat, text_string)
print(match.group(1).strip())

当我在长文本文件中搜索同一行上的短语时,它就起作用了。但一旦他们在不同的线路上,它就不再起作用了

下面是一个输入文本示例,给了我一个问题:

Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency 
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an 
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance 
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a 
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly 
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill 
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands 

Tags: oftotext文本clientfordataon
1条回答
网友
1楼 · 发布于 2024-04-29 11:06:48

请注意.匹配除换行符以外的任何字符。因此,您可以使用(.|\n)捕获所有内容。而且,这条线可能会在你的固定模式内断裂。首先定义图案的前缀和后缀:

prefix=r"Observations\s+of\s+Client\s+Behavior:"
sufix=r"Observations\s+of\s+Client's\s+response\s+to\s+skill\s+acquisition:"

然后创建图案并查找所有引用:

pattern=prefix+r"((?:.|\n)*?)"+suffix
f=re.findall(pattern,text_string)

通过在r"((?:.|\n)*?)"的末尾使用*?,我们可以匹配尽可能少的字符

多行多模式示例:

text_string = '''any thing Observations of Client Behavior: patern1 Observations of Client's 
response to skill acquisition: any thing
any thing Observations of Client Behavior: patern2 Observations of 
Client's response to skill acquisition: any thing Observations of Client
Behavior: patern3 Observations of Client's response to skill acquisition: any thing any thing'''

result=re.findall(pattern,text_string)

result=[' patern1 ', ' patern2 ', ' patern3 ']

检查结果here

相关问题 更多 >