如何使用pdfplumber从PDF中提取半结构化表格

0 投票

1 回答

46 浏览

提问于 2025-04-12 21:55

我想从PDF文件中提取半结构化的表格。如果有其他模块比pdfplumber更好用，我也会考虑使用。我需要的不仅仅是表格，有时候表格上方的文字也是表格的一部分（比如列名有时候就在表格上方），或者表格可能会在另一页继续。

我尝试使用extract_text_lines()，效果不错。我想逐行检查PDF，如果某一行是表格，我就开始收集这些数据。

def extract_table_from_page(pdf_path, page_number):

    with pdfplumber.open(pdf_path) as pdf:

        page = pdf.pages[page_number]
        lines = page.extract_text_lines()
        for line in lines:
            if 'chars' in line.keys():
                print(line)

文本分析 PDF处理 pdf提取半结构化数据表格识别数据收集

1 个回答

这里有一个PyMuPDF的例子，展示了一个表格，表格的列标题有不同的旋转角度，包括多行的列标题。

有些列名是竖着写的。

下面是一个PyMuPDF的脚本，它可以找到并提取表格，识别列名，并以markdown格式（兼容Github）打印表格内容：

import fitz  # PyMuPDF
doc=fitz.open("input.pdf")  # test file
page=doc[0]  # first page having the table
tabs=page.find_tables()  # find tables on page
tab=tabs[0]  # take first table
print(tab.to_markdown())  # print all content in Github-markdown format

|Column1|column2|column3 line 2|column4 line 2|
|---|---|---|---|
|11|22|33|44|
|55|66|77|88|
|99|AA|BB|CC|
|DD|EE|FF||


tab.header.external  # show some table header properties
True

tab.header.names
['Column1', 'column2', 'column3 line 2', 'column4 line 2']

顺便说一下：还有其他格式可以选择，比如Python的列表或输出到pandas的DataFrame。

注意：我是PyMuPDF的维护者和原始创建者。

回答于 2025-04-12 由 Python大师

分享举报

如何使用pdfplumber从PDF中提取半结构化表格

1 个回答

撰写回答