复杂正则表达式得到的值低于预期值

2024-05-14 03:39:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在python2.7中摆弄正则表达式,以便在文本中捕捉带编号的脚注。从PDF转换的文本如下所示:

test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim 
Participation, in which it decided that the victims “may, through their legal 

1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

 8/117 
representatives, participate in the present appeal proceedings for the purpose of 
presenting their views and concerns in respect of their personal interests in the issues 
on appeal”.3

8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the 
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The 
Prosecutor filed a confidential redacted version of the Document in Support of the 
Appeal on 22 March 2013, and a public redacted version of the Document in Support 
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of 
the Appeal, the Prosecutor’s entire third ground of appeal was redacted. 

"""

请注意,编号段落是我文本的常规内容,前缀是数字和点(如“5”)。 理想的情况下,我想得到这样的东西:

[(1,"The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. "), (2, "A more detailed procedural history is set out in Annex 2 of this judgment." 

我获取脚注的Python代码是:

regex = ur"""
(\r?\n)(?P<num>\d+)(?!\.) #first line
(?P<text>(?:\s(.|\r?\n)+?\s?(?:\n\n|\Z))) #following lines
"""
result = re.findall(regex, test_str, re.U|re.VERBOSE | re.X |re.MULTILINE)

这给了我:

[(u'\n', u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n\n', u'.')]

也就是说,只有第一个脚注,而我需要两个都偏离轨道

欢迎有任何想法!你知道吗


Tags: andoftheinresupportonthis
2条回答

你可以用这个正则表达式把数据分成两部分,第一部分是数字,第二部分是段落数据

(?s)(\d+)\n +(.*?)\s*(?=\d+\n)

说明:

  • (?s)>;使点能够匹配我们在这里需要的新行
  • (\d+)>;匹配一个或多个数字并将它们放入组1
  • \n +>;匹配换行符," +"只会占用第二个捕获组中不需要的任何空间
  • (.*?)>;此组捕获group2中的预期数据和位置
  • \s*>;这只会占用任何不需要进入预期文本捕获的空间
  • (?=\d+\n)>;向前看点以停止捕获所需的文本

Live Demo

这是你的代码的修改版本

import re

test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim 
Participation, in which it decided that the victims “may, through their legal 

1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

 8/117 
representatives, participate in the present appeal proceedings for the purpose of 
presenting their views and concerns in respect of their personal interests in the issues 
on appeal”.
3

8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the 
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The 
Prosecutor filed a confidential redacted version of the Document in Support of the 
Appeal on 22 March 2013, and a public redacted version of the Document in Support 
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of 
the Appeal, the Prosecutor’s entire third ground of appeal was redacted. 

"""

result = re.findall(r'(?s)(\d+)\n +(.*?)\s*(?=\d+\n)', test_str)

print(result)

它会像你所期望的那样给出以下输出

[('1', 'The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1.'), ('2', 'A more detailed procedural history is set out in Annex 2 of this judgment. \nICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A\n\n 8/117 \nrepresentatives, participate in the present appeal proceedings for the purpose of \npresenting their views and concerns in respect of their personal interests in the issues \non appeal".')]

我相信这个正则表达式:(^\d+(?!\.).*?)(?=^\s*\d)如您所描述的那样工作。你知道吗

Demo

Python演示:

>>> import re
>>> print ''.join(re.findall(r'(^\d+(?!\.).*?)(?=^\s*\d)', test_str, flags=re.M|re.S))
1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

如果要捕获与文本分开的脚注编号:

>>> re.findall(r'^(\d+)((?!\.).*?)(?=\s*^\d)', test_str, flags=re.M|re.S)
[(u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n'), (u'2', u'\n A more detailed procedural history is set out in Annex 2 of this judgment. \nICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A\n')]

相关问题 更多 >