使用regex从PDF原始文本中提取子字符串

2条回答

网友

1楼 · 编辑于 2024-06-01 02:12:01

经过一番探索，以下是最接近我期望实现的解决方案：

regx = re.compile( ': \ni(?:(?!\n[A-Z]).).*?\.\n\d\.|:\ni(?:(?!\n[A-Z]).).*?\.\n\d\.',re.DOTALL)
find = str(cleanSectionContent2[req])

它检测那些以“：i.”开头并以节头“\n\d.”结尾的情况，但它无法检测所有情况，因此我将在此处更新更多解决方案。你知道吗

网友

2楼 · 编辑于 2024-06-01 02:12:01

如果我正确理解这个问题，我们只需要去掉罗马索引，得到整个段落，我们将从一个简单的表达式开始，例如：

.+[0-9]\.?.+?([A-Z][a-z].*)

当出现新的情况时，我们只需要使用逻辑OR并添加额外的规则。你知道吗

Demo

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r".+[0-9]\.?.+?([A-Z][a-z].*)"

test_str = ("\\n1.1\\n \\nSCOPE\\n \\nThis PTS specifies the\\n \\nrequirements \\nand recommendations for Classification, Verification \\n\\nFunct\\nions.\\n \\nThe scope includes the following:\\n \\ni.\\n \\nSemi\\n-\\nquantitative SIL classification\\n \\nii.\\n \\nSpurious trip analysis\\n \\niii.\\n \\nProbabilistic and architectural SIL verification\\n \\niv.\\n \\nRecommendations\\n \\nfor SIL gap closure'\n\n"
    "3.1.3\\n \\nDo\\nc\\numentation\\n \\nrequired\\n \\nT\\nh\\ne\\n \\nl\\nat\\ne\\ns\\nt\\n \\nissue\\n \\nof\\n \\nt\\nh\\ne\\n \\nf\\no\\nllo\\nw\\ni\\nng\\n \\ndocume\\nn\\nts\\n \\nshall\\n \\nbe\\n \\nav\\na\\nilab\\nl\\ne\\n \\nto\\n \\nthe\\n \\nte\\na\\nm\\n \\np\\ne\\nrf\\no\\nrm\\ni\\nng\\n \\nt\\nh\\ne \\nc\\nl\\nass\\ni\\nf\\ni\\ncati\\no\\nn:\\n \\ni.\\n \\nMandatory reference document\\n \\na)\\n \\nCause and effect matrices (CEM)\\n \\nb)\\n \\nPiping and Instrument Diagram (P&ID) or Process and utility engineering \\nflow schemes (PEFS)\\n \\nc)\\n \\nHAZOP report\\n \\nd)\\n \\nIPF reliability data\\n \\nii.\\n \\nOther reference document\\n \\na)\\n \\nProcess Flow Diagram (PFD) or Process Fl\\now Scheme (PFS)\\n \\nb)\\n \\nPlant layout drawing\\n \\nc)\\n \\nProcess safeguarding flow schemes (PSFS)\\n \\nd)\\n \\nControl narratives\\n \\ne)\\n \\nInterlocks/ ESD logic diagram\\n \\nf)\\n \\nEquipment layout diagram\\n \\ng)\\n \\nMaintenance and Inspection Data\\n \\nh)\\n \\nPlant historian data\\n \\n \\nT\\nh\\ne\\n \\nl\\ni\\ns\\nt\\n \\na\\nb\\no\\nve\\n \\nis\\n \\nn\\no\\nt\\n \\ne\\nx\\nh\\na\\nu\\nsti\\nv\\ne. Any\\n \\not\\nh\\ne\\nr\\n \\ndo\\nc\\nu\\nm\\ne\\nn\\nt\\ns\\n/ \\nd\\nr\\na\\nw\\nin\\ng\\ns\\n \\nreq\\nu\\nir\\ne\\nd\\n \\nf\\no\\nr\\n \\nt\\nhe \\nc\\nom\\np\\nletion\\n \\no\\nf the\\n \\nIPF\\n \\ns\\nt\\nu\\nd\\ny\\n \\ns\\nh\\na\\nll\\n \\nbe\\n \\nf\\nu\\nr\\nn\\nished\\n \\nas\\n \\na\\nn\\nd\\n \\nw\\nhen\\n \\nre\\nq\\nui\\nr\\ne\\nd\\n.\\n \\n")

subst = "\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

正则表达式

如果不需要这个表达式，可以在regex101.com中修改/更改它。你知道吗

正则表达式电路

jex.im可视化正则表达式：

Demo

测试

正则表达式

正则表达式电路

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用regex从PDF原始文本中提取子字符串

Demo

测试

正则表达式

正则表达式电路

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >