使用regex从PDF原始文本中提取子字符串

2024-06-01 02:12:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从pdf文档中提取具有罗马索引的小节。你知道吗

例如,这是文件的一部分

\n1.1\n \nSCOPE\n \nThis PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure'

我想要的只是:

This PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure

我需要在罗马索引之前的句子以及罗马索引中的内容。你知道吗

然而,也有如下情况

3.1.3\n \nDo\nc\numentation\n \nrequired\n \nT\nh\ne\n \nl\nat\ne\ns\nt\n \nissue\n \nof\n \nt\nh\ne\n \nf\no\nllo\nw\ni\nng\n \ndocume\nn\nts\n \nshall\n \nbe\n \nav\na\nilab\nl\ne\n \nto\n \nthe\n \nte\na\nm\n \np\ne\nrf\no\nrm\ni\nng\n \nt\nh\ne \nc\nl\nass\ni\nf\ni\ncati\no\nn:\n \ni.\n \nMandatory reference document\n \na)\n \nCause and effect matrices (CEM)\n \nb)\n \nPiping and Instrument Diagram (P&ID) or Process and utility engineering \nflow schemes (PEFS)\n \nc)\n \nHAZOP report\n \nd)\n \nIPF reliability data\n \nii.\n \nOther reference document\n \na)\n \nProcess Flow Diagram (PFD) or Process Fl\now Scheme (PFS)\n \nb)\n \nPlant layout drawing\n \nc)\n \nProcess safeguarding flow schemes (PSFS)\n \nd)\n \nControl narratives\n \ne)\n \nInterlocks/ ESD logic diagram\n \nf)\n \nEquipment layout diagram\n \ng)\n \nMaintenance and Inspection Data\n \nh)\n \nPlant historian data\n \n \nT\nh\ne\n \nl\ni\ns\nt\n \na\nb\no\nve\n \nis\n \nn\no\nt\n \ne\nx\nh\na\nu\nsti\nv\ne. Any\n \not\nh\ne\nr\n \ndo\nc\nu\nm\ne\nn\nt\ns\n/ \nd\nr\na\nw\nin\ng\ns\n \nreq\nu\nir\ne\nd\n \nf\no\nr\n \nt\nhe \nc\nom\np\nletion\n \no\nf the\n \nIPF\n \ns\nt\nu\nd\ny\n \ns\nh\na\nll\n \nbe\n \nf\nu\nr\nn\nished\n \nas\n \na\nn\nd\n \nw\nhen\n \nre\nq\nui\nr\ne\nd\n.\n \n

我已经将pdf转换成原始文本,并设法提取文档的一部分。那个

regx = re.compile( '\.\n \n.+?:\n \n',re.DOTALL)
find = str(txt)
indexhead.append((regx.findall(find)))

上面的代码只能提取标题,不能同时提取罗马文索引

.\n \nThe scope includes the following:\n \n

我正试图根据模式进行提取,但我认为一些条件规则可能会有所帮助。你知道吗


Tags: andthenonnnunensnc
2条回答

经过一番探索,以下是最接近我期望实现的解决方案:

regx = re.compile( ': \ni(?:(?!\n[A-Z]).).*?\.\n\d\.|:\ni(?:(?!\n[A-Z]).).*?\.\n\d\.',re.DOTALL)
find = str(cleanSectionContent2[req])

它检测那些以“:i.”开头并以节头“\n\d.”结尾的情况,但它无法检测所有情况,因此我将在此处更新更多解决方案。你知道吗

如果我正确理解这个问题,我们只需要去掉罗马索引,得到整个段落,我们将从一个简单的表达式开始,例如:

.+[0-9]\.?.+?([A-Z][a-z].*)

当出现新的情况时,我们只需要使用逻辑OR并添加额外的规则。你知道吗

Demo

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r".+[0-9]\.?.+?([A-Z][a-z].*)"

test_str = ("\\n1.1\\n \\nSCOPE\\n \\nThis PTS specifies the\\n \\nrequirements \\nand recommendations for Classification, Verification \\n\\nFunct\\nions.\\n \\nThe scope includes the following:\\n \\ni.\\n \\nSemi\\n-\\nquantitative SIL classification\\n \\nii.\\n \\nSpurious trip analysis\\n \\niii.\\n \\nProbabilistic and architectural SIL verification\\n \\niv.\\n \\nRecommendations\\n \\nfor SIL gap closure'\n\n"
    "3.1.3\\n \\nDo\\nc\\numentation\\n \\nrequired\\n \\nT\\nh\\ne\\n \\nl\\nat\\ne\\ns\\nt\\n \\nissue\\n \\nof\\n \\nt\\nh\\ne\\n \\nf\\no\\nllo\\nw\\ni\\nng\\n \\ndocume\\nn\\nts\\n \\nshall\\n \\nbe\\n \\nav\\na\\nilab\\nl\\ne\\n \\nto\\n \\nthe\\n \\nte\\na\\nm\\n \\np\\ne\\nrf\\no\\nrm\\ni\\nng\\n \\nt\\nh\\ne \\nc\\nl\\nass\\ni\\nf\\ni\\ncati\\no\\nn:\\n \\ni.\\n \\nMandatory reference document\\n \\na)\\n \\nCause and effect matrices (CEM)\\n \\nb)\\n \\nPiping and Instrument Diagram (P&ID) or Process and utility engineering \\nflow schemes (PEFS)\\n \\nc)\\n \\nHAZOP report\\n \\nd)\\n \\nIPF reliability data\\n \\nii.\\n \\nOther reference document\\n \\na)\\n \\nProcess Flow Diagram (PFD) or Process Fl\\now Scheme (PFS)\\n \\nb)\\n \\nPlant layout drawing\\n \\nc)\\n \\nProcess safeguarding flow schemes (PSFS)\\n \\nd)\\n \\nControl narratives\\n \\ne)\\n \\nInterlocks/ ESD logic diagram\\n \\nf)\\n \\nEquipment layout diagram\\n \\ng)\\n \\nMaintenance and Inspection Data\\n \\nh)\\n \\nPlant historian data\\n \\n \\nT\\nh\\ne\\n \\nl\\ni\\ns\\nt\\n \\na\\nb\\no\\nve\\n \\nis\\n \\nn\\no\\nt\\n \\ne\\nx\\nh\\na\\nu\\nsti\\nv\\ne. Any\\n \\not\\nh\\ne\\nr\\n \\ndo\\nc\\nu\\nm\\ne\\nn\\nt\\ns\\n/ \\nd\\nr\\na\\nw\\nin\\ng\\ns\\n \\nreq\\nu\\nir\\ne\\nd\\n \\nf\\no\\nr\\n \\nt\\nhe \\nc\\nom\\np\\nletion\\n \\no\\nf the\\n \\nIPF\\n \\ns\\nt\\nu\\nd\\ny\\n \\ns\\nh\\na\\nll\\n \\nbe\\n \\nf\\nu\\nr\\nn\\nished\\n \\nas\\n \\na\\nn\\nd\\n \\nw\\nhen\\n \\nre\\nq\\nui\\nr\\ne\\nd\\n.\\n \\n")

subst = "\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

正则表达式

如果不需要这个表达式,可以在regex101.com中修改/更改它。你知道吗

正则表达式电路

jex.im可视化正则表达式:

enter image description here

相关问题 更多 >