正则表达式在字符串中查找所有完整的句子

2024-04-28 12:59:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经看过这个线程:Regex to find all sentences of text?,但似乎无法让它解决我的确切场景。以下是我正在处理的文本:


import regex as re

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )

phrase = """For necessary expenses of the Office of Inspector 
General, including employment pursuant to the Inspector 
General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), 
$99,912,000, including such sums as may be necessary for 
contracting and other arrangements with public agencies 
and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 
U.S.C. App.), and including not to exceed $125,000 for 
certain confidential operational expenses, including the 
payment of informants, to be expended under the direction 
of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and 
section 1337 of the Agriculture and Food Act of 1981. For necessary expenses of the Office of the General 
23 Counsel, $45,390,000."""

phrase = phrase.replace("\n", "")

sentence.findall(phrase)

# outputs:
['For necessary expenses of the Office of Inspector General, including employment pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
 'App.), $99,912,000, including such sums as may be necessary for contracting and other arrangements with public agencies and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
 'App.), and including not to exceed $125,000 for certain confidential operational expenses, including the payment of informants, to be expended under the direction of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
 'App.) and section 1337 of the Agriculture and Food Act of 1981. ']

在这种情况下,这个长短语中只有两个实际句子。第一个是:

For necessary expenses of the Office of Inspector General, including employment pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), $99,912,000, including such sums as may be necessary for contracting and other arrangements with public agencies and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), and including not to exceed $125,000 for certain confidential operational expenses, including the payment of informants, to be expended under the direction of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and section 1337 of the Agriculture and Food Act of 1981.

第二个是:

For necessary expenses of the Office of the General 23 Counsel, $45,390,000.

有没有办法,通过正则表达式或其他方式,提取我想要的东西?最终目标是能够提取所有完整的句子,然后在其中搜索某些内容。(如果这对解决方案有影响)


Tags: andofthetoappinspectorbepublic
2条回答

试试这个

regex = "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
re.split(regex, phrase)
import re
print ([x for x in re.split(r"([A-Z].+(\(.+\)){0,1}.+)\.\s",s.replace("\n"," ")) if x])

输出:

['For necessary expenses of the Office of Inspector  General, including employment pursuant to the Inspector  General Act of 1978 (Public Law 95–452; 5 U.S.C. App.),  $99,912,000, including such sums as may be necessary for  contracting and other arrangements with public agencies  and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5  U.S.C. App.), and including not to exceed $125,000 for  certain confidential operational expenses, including the  payment of informants, to be expended under the direction  of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and  section 1337 of the Agriculture and Food Act of 1981', 'For necessary expenses of the Office of the General  23 Counsel, $45,390,000.']

正则表达式是:

regex = r"([A-Z].+(\(.+\)){0,1}.+)\.\s"

re.split(r"([A-Z].+(\(.+\)){0,1}.+)\.\s",s.replace("\n"," "))

相关问题 更多 >