PyPDF2：提取目录/大纲及其页码

[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'}, {'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'}, ... {'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, [{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, {'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'} ...

2条回答

网友

1楼 · 编辑于 2024-05-13 11:08:03

Martin Thoma's answer正是我所需要的（PyMuPDF）。 Diblo Dk's answer也是一个有趣的解决方法（PyPDF2）

我引用的正是Martin Thoma的代码：

from typing import Dict

import fitz  # pip install pymupdf


def get_bookmarks(filepath: str) -> Dict[int, str]:
    # WARNING! One page can have multiple bookmarks!
    bookmarks = {}
    with fitz.open(filepath) as doc:
        toc = doc.getToC()  # [[lvl, title, page, …], …]
        for level, title, page in toc:
            bookmarks[page] = title
    return bookmarks


print(get_bookmarks("my.pdf"))

网友

2楼 · 编辑于 2024-05-13 11:08:03

查看名为tabla的包。使用此包提取表非常容易。该包还提供了一些选项，使您能够从扩展到多个页面的表中提取内容

下面是值得查看的链接：-https://towardsdatascience.com/scraping-table-data-from-pdf-files-using-a-single-line-in-python-8607880c750

相关问题更多 >

编程相关推荐

热门问题

热门文章