我试图使用一个regex语句来提取两个已知短语之间的特定文本块,这些短语将在其他文档中重复,并删除所有其他内容。这几句话将被传递到其他函数中
我的问题似乎是,当我使用一个regex语句时,它在同一行上有我搜索的单词,它是有效的。如果他们在不同的线路上,我会得到:
print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'
我希望将来的报告在不同的点上有换行符,这取决于以前写的内容-有没有办法先删除所有换行符来准备文本,或者让我的正则表达式在搜索时忽略这些换行符
任何帮助都会很好,谢谢
import fitz
import re
doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
text_list.append(page.getText())
#print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"
match = re.search(pat, text_string)
print(match.group(1).strip())
当我在长文本文件中搜索同一行上的短语时,它就起作用了。但一旦他们在不同的线路上,它就不再起作用了
下面是一个输入文本示例,给了我一个问题:
Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands
请注意
.
匹配除换行符以外的任何字符。因此,您可以使用(.|\n)
捕获所有内容。而且,这条线可能会在你的固定模式内断裂。首先定义图案的前缀和后缀:然后创建图案并查找所有引用:
通过在
r"((?:.|\n)*?)"
的末尾使用*?
,我们可以匹配尽可能少的字符多行多模式示例:
检查结果here
相关问题 更多 >
编程相关推荐