从段落中提取复杂词

2024-06-09 16:17:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个四列pdf表格中读取的段落中提取一组文本。你知道吗

这是原始文本

  015536159/6630     CAGE Contract Number Quantity Unit Cost AWD Date 32YK1 SPE2DH19P0522 22.000 1394.13000 20190102 32YK1 SPE2DH18P1630 21.000 1356.41000 20180604 74YZ3  SPE2DH18P1184 15.000 1282.50000 20180314 32YK1 SPE2DH17V1630 16.000 1335.91000 20170214 58837 SPE2DH16V2501 17.000 1369.00000 20160601 32YK1 SPE2DH16M0463 13.000 1358.20000 20151125  CONTINUED ON NEXT PAGE       
                                                           CONTINUATION SHEET     REFERENCE NO. OF DOCUMENT BEING CONTINUED: SPE2DH-19-T-6601     PAGE 4 OF 22 PAGES        SECTION A  Procurement History for NSN/FSC:015536159/6630  CAGE Contract Number  Quantity Unit Cost AWD Date              32YK1 S$ DH16M0068 32YK1 SPE2DH14V3122 32YK1 S$ DH14V2252 32YK1 SPE2DH14V0165     58837 SPM2DH13V1222 08576 SPM2DH13M0509 58837 SPM2DH12V0342 08576 SPM2DH12M0490 08576 SPM2DH11V1261 3BSP4 SPM2DSO8MA800 3BSP4 SPM2DS08M6542 3BSP4 SPM2DS08M5128 3BSP4 SPM2DS08M5127 3BSP4 SPM2DS08M5125  18.000 1462.05000 20151005 12.000 1246.39000 20140918 9.000 1246.39000 20140711 10.000 1246.39000 20131223 12.000 1258.00000 20130724 15.000 1100.09000 20121205 27.000 1200.00000 20111223 34.000 1057.77000 20111202 3.000 1057.77000 20110727  2.000 947.16000 20080721 100.000 947.16000 20080323 2.000 947.16000 20080227 2.000 947.16000 20080227 2.000 947.16000 20080225  CONTINUED ON NEXT PAGE       
             CONTINUATION SHEET REFERENCE NO. OF DOCUMENT BEING CONTINUED:  SPE2DH-19-T-6601        PAGE 5 OF 22 PAGES        SECTION B 

我只想提取这些文本

 32YK1 SPE2DH19P0522 22.000 1394.13000 20190102 32YK1 SPE2DH18P1630 21.000 1356.41000 20180604 74YZ3  SPE2DH18P1184 15.000 1282.50000 20180314 32YK1 SPE2DH17V1630 16.000 1335.91000 20170214 58837 SPE2DH16V2501 17.000 1369.00000 20160601 32YK1 SPE2DH16M0463 13.000 1358.20000 20151125
  32YK1 S$ DH16M0068 32YK1 SPE2DH14V3122 32YK1 S$ DH14V2252 32YK1 SPE2DH14V0165     58837 SPM2DH13V1222 08576 SPM2DH13M0509 58837 SPM2DH12V0342 08576 SPM2DH12M0490 08576 SPM2DH11V1261 3BSP4 SPM2DSO8MA800 3BSP4 SPM2DS08M6542 3BSP4 SPM2DS08M5128 3BSP4 SPM2DS08M5127 3BSP4 SPM2DS08M5125  18.000 1462.05000 20151005 12.000 1246.39000 20140918 9.000 1246.39000 20140711 10.000 1246.39000 20131223 12.000 1258.00000 20130724 15.000 1100.09000 20121205 27.000 1200.00000 20111223 34.000 1057.77000 20111202 3.000 1057.77000 20110727  2.000 947.16000 20080721 100.000 947.16000 20080323 2.000 947.16000 20080227 2.000 947.16000 20080227 2.000 947.16000 20080225

我尝试过很多不同的方法,比如创建一个不需要的单词数组,然后用这段代码从段落中删除它们

 filterWords= [preNSN,FSC,NSN,'NSN/FSC:'+NSN,'Cage','Contract','Number','Quantity','Unit','Cost','AWD','Date','CONTINUED', 'SECTION', 'Procurement','history','For','on','Next','Page','Continuation','Sheet','Reference','of','Document','Being','CONTINUED','pages','SECTION']


 regex = r'\b(?:' + '|'.join(filterWords) + r')\s*\b'
 filteredHistory = re.sub(regex, '', history, flags=re.IGNORECASE)

问题是有时并非所有不需要的单词都被删除。有没有一种方法可以只针对想要的单词而不是删除不想要的单词?你知道吗


Tags: of文本numberdatepageunitsection单词
2条回答

不是100%清楚你要搜索的模式,假设你想要在每一行上以“32YK1”开始并以“下一页继续”结束的文本。这允许您在这两个想要的单词之间找到文本

import re
matches = re.findall(r'32YK1.*CONTINUED ON NEXT PAGE', your_string)
lines = []
for match in matches:
    lines.append(match.replace("CONTINUED ON NEXT PAGE", ""))`

如果文档的结构相似,可以尝试在标记之间解析文本。你知道吗

也许你需要盯着你的数据看一段时间,看看它的结构。你知道吗

从提供的样本来看,这是有效的:

raw = """  015536159/6630     CAGE Contract Number Quantity Unit Cost AWD Date 32YK1 SPE2DH19P0522 22.000 1394.13000 20190102 32YK1 SPE2DH18P1630 21.000 1356.41000 20180604 74YZ3  SPE2DH18P1184 15.000 1282.50000 20180314 32YK1 SPE2DH17V1630 16.000 1335.91000 20170214 58837 SPE2DH16V2501 17.000 1369.00000 20160601 32YK1 SPE2DH16M0463 13.000 1358.20000 20151125  CONTINUED ON NEXT PAGE       
                                                           CONTINUATION SHEET     REFERENCE NO. OF DOCUMENT BEING CONTINUED: SPE2DH-19-T-6601     PAGE 4 OF 22 PAGES        SECTION A  Procurement History for NSN/FSC:015536159/6630  CAGE Contract Number  Quantity Unit Cost AWD Date              32YK1 S$ DH16M0068 32YK1 SPE2DH14V3122 32YK1 S$ DH14V2252 32YK1 SPE2DH14V0165     58837 SPM2DH13V1222 08576 SPM2DH13M0509 58837 SPM2DH12V0342 08576 SPM2DH12M0490 08576 SPM2DH11V1261 3BSP4 SPM2DSO8MA800 3BSP4 SPM2DS08M6542 3BSP4 SPM2DS08M5128 3BSP4 SPM2DS08M5127 3BSP4 SPM2DS08M5125  18.000 1462.05000 20151005 12.000 1246.39000 20140918 9.000 1246.39000 20140711 10.000 1246.39000 20131223 12.000 1258.00000 20130724 15.000 1100.09000 20121205 27.000 1200.00000 20111223 34.000 1057.77000 20111202 3.000 1057.77000 20110727  2.000 947.16000 20080721 100.000 947.16000 20080323 2.000 947.16000 20080227 2.000 947.16000 20080227 2.000 947.16000 20080225  CONTINUED ON NEXT PAGE       
             CONTINUATION SHEET REFERENCE NO. OF DOCUMENT BEING CONTINUED:  SPE2DH-19-T-6601        PAGE 5 OF 22 PAGES        SECTION B """

start = '32YK1'

result = []
for line in raw.splitlines():
    res = []
    seen = False
    for elt in line.split():
        if elt == start:      # <- start recording tokens past (and including) this one.
            seen = True
        if seen:
            res.append(elt)
    result.append(res[:-4])   # <- remove the last 4 unneeded tokens using slicing

for res in result:
    print(' '.join(res))

[编辑]:解释代码

对数据的观察表明,每行上所需的信息都以token '32YK1'开头;在此之前的信息被丢弃。
进一步分析表明,不需要最后四个标记;它们被排除在最终选择之外。你知道吗

相关问题 更多 >