如何确保捕获组超过5个字符?

2024-04-20 06:05:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用的代码是:

(?i)(?<!see )(?<!\d)(?<!")(?<!“)ITEM.*?1A.*?\n*(?<!")(?<!“)RISK.*?FACTORS(?<!")\n*([\s\S]*?)\n*ITEM.*?1B

抓取的文本介于ITEM 1A. RISK FACTORSITEM 1B.之间,但是我如何只能抓取超过5个字符的抓取组?你知道吗

完整字符串:

ITEM 1A.    RISK FACTORS

123

ITEM 1B.

ITEM 1A.    RISK FACTORS

In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or 

ITEM 1B.

因此,理想的捕获组是:

In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or 

而不是:

123

Tags: thetoinforminformationthisitemfollowing
2条回答

在这样的数据附近进行计数。
如果需要,可以用\h替换[^\S\r\n]来显著缩短正则表达式。
组1包含修剪后的数据。你知道吗

(?sm)^[^\S\r\n]*ITEM[^\S\r\n]+1A[^\S\r\n]*\.[^\S\r\n]+RISK[^\S\r\n]+FACTORS[^\S\r\n]*\r?\n\s*(\S(?:(?!^[^\S\r\n]*ITEM).){3,}?\S)\s*^[^\S\r\n]*ITEM[^\S\r\n]+1B[^\S\r\n]*\.

https://regex101.com/r/ChQseo/1

扩展

 (?sm)
 ^ [^\S\r\n]* ITEM [^\S\r\n]+ 1A [^\S\r\n]* \. 
 [^\S\r\n]+ RISK [^\S\r\n]+ FACTORS [^\S\r\n]* \r? \n 

 \s* 
 (                             # (1 start)
      \S 
      (?:
           (?! ^ [^\S\r\n]* ITEM )
           . 
      ){3,}?
      \S 
 )                             # (1 end)
 \s* 

 ^ [^\S\r\n]* ITEM [^\S\r\n]+ 1B [^\S\r\n]* \.

我猜也许

(?i)(?<!see )(?<!\d)(?<!")(?<!“)ITEM.*?1A.*?\n*(?<!")(?<!“)RISK.*?FACTORS(?<!")\n*([^\r\n]{5,}?)\s*\n*ITEM.*?1B

可能有点接近你的想法,但不确定。你知道吗

它也可能使用re.DOTALL标志:

import re

regex = r'(?i)(?<!see )(?<!\d)(?<!")(?<!“)ITEM.*?1A.*?\n*(?<!")(?<!“)RISK.*?FACTORS(?<!")\n*([^\r\n]{5,}?)\s*\n*ITEM.*?1B'
string = '''

ITEM 1A.    RISK FACTORS

123

ITEM 1B.

ITEM 1A.    RISK FACTORS

In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or 

ITEM 1B.

'''

print(re.findall(regex, string, re.DOTALL))

输出

['In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or']


如果您希望简化/修改/探索表达式,在regex101.com的右上面板中已经对其进行了解释。如果您愿意,还可以在this link中查看它如何与一些示例输入匹配。你知道吗


相关问题 更多 >