如何使用PDFMiner从pdf中提取表格?

2024-04-20 12:13:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从pdf文档中的一些表中提取信息。
考虑输入:

Title 1
some text some text some text some text some text
some text some text some text some text some text

Table Title
| Col1          | Col2    | Col3    |
|---------------|---------|---------|
| val11         | val12   | val13   |
| val21         | val22   | val23   |
| val31         | val32   | val33   |

Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

我可以得到这样的大纲/标题:

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

这给了我:

(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')

这是完美的,因为级别与文本层次结构对齐。现在我可以提取文本如下:

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
                                            for i in element.get_text()]))

这给了我:

Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text

这有点奇怪,因为表是按列方式提取的。我能把桌子一行一行地拿到吗?此外,如何确定表的开始和结束位置?


Tags: thetextinforobjectpdftitledevice
2条回答

如果你只想从PDF文档中提取表,那么看看这个答案:How to extract table as text from the PDF using Python?

从这个答案开始,我尝试了tabula-py,这对我来说很有用,它可以在多页的PDF文件上显示图表。表格py正确跳过了所有的页眉和页脚。以前我在同一类型的文档上尝试过PDFMiner,我遇到了与您提到的相同的问题,有时甚至更糟。

使用camelot从pdf中提取表

相关问题 更多 >