如何从PDF文件中提取文本及其坐标？

1

这个用pymupdf很简单就能做到，具体可以参考这个链接：https://pymupdf.readthedocs.io/en/latest/app1.html

import fitz
with fitz.open(path_to_pdf_file) as document:
    words_dict = {}
    for page_number, page in enumerate(document):
        words = page.get_text("words")
        words_dict[page_number] = words

回答于 2025-04-18 由 Python大师

分享举报

50

换行符在最终输出中会被转换成下划线。这是我找到的最简单可行的解决方案。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf', 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    # loop over the object list
    for obj in lt_objs:

        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))

        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):

    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()

    # extract text from this object
    parse_obj(layout._objs)

回答于 2025-04-18 由 Python大师

分享举报

59

这里有一个可以直接复制粘贴的例子，它可以列出PDF中每个文本块的左上角位置。我认为这个方法适用于不包含“表单X对象”的任何PDF。

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open('yourpdf.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('Processing next page...')
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('At %r is text: %s' % ((x, y), text))

上面的代码是基于PDFMiner文档中的布局分析示例，以及pnj（https://stackoverflow.com/a/22898159/1709587）和Matt Swain（https://stackoverflow.com/a/25262470/1709587）的示例。我对这些之前的例子做了一些修改：

我使用了PDFPage.get_pages()，这是创建文档、检查它是否可以提取文本并将其传递给PDFPage.create_pages()的简写方式。
我没有处理LTFigure，因为PDFMiner目前无法干净地处理其中的文本。

LAParams让你设置一些参数，控制PDF中的字符是如何被PDFMiner神奇地分组为行和文本框的。如果你对这种分组感到惊讶，认为这根本不需要发生，实际上这是有原因的，详见pdf2txt文档：

在实际的PDF文件中，文本部分可能会根据作者软件的不同而被分成几个块。因此，文本提取需要拼接这些文本块。

LAParams的参数像大多数PDFMiner的内容一样，没有详细文档，但你可以在源代码中查看，或者在你的Python环境中输入help(LAParams)来获取帮助。某些参数的含义可以在https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-py找到，因为它们也可以作为命令行中pdf2text的参数传递。

上面的layout对象是一个LTPage，它是“布局对象”的可迭代集合。每个布局对象可以是以下几种类型之一……

LTTextBox
LTFigure
LTImage
LTLine
LTRect

……或者它们的子类。（特别是，你的文本框可能都是LTTextBoxHorizontal。）

关于LTPage结构的更多细节可以通过文档中的这张图片查看：

LTPage结构的树状图。与此答案相关的是：它显示了一个LTPage包含上面列出的5种类型，并且一个LTTextBox包含LTTextLine和其他未指定的内容，而一个LTTextLine包含LTChar、LTAnno、LTText和其他未指定的内容。

上述每种类型都有一个.bbox属性，包含一个(x0, y0, x1, y1)元组，表示对象的左、下、右、上的坐标。y坐标是从页面的底部开始计算的。如果你更习惯从上到下处理y轴，可以从页面的.mediabox高度中减去这些值：

x0, y0_orig, x1, y1_orig = some_lobj.bbox
y0 = page.mediabox[3] - y1_orig
y1 = page.mediabox[3] - y0_orig

除了bbox，LTTextBox还有一个.get_text()方法，如上所示，它返回文本内容作为字符串。请注意，每个LTTextBox都是LTChar（PDF中明确绘制的字符，带有bbox）和LTAnno（PDFMiner根据字符之间的距离在文本框内容的字符串表示中添加的额外空格；这些没有bbox）的集合。

在这个答案开头的代码示例结合了这两个属性，显示了每个文本块的坐标。

最后，值得注意的是，与上面引用的其他Stack Overflow答案不同，我没有深入处理LTFigure。虽然LTFigure可以包含文本，但PDFMiner似乎无法将这些文本分组为LTTextBox（你可以在https://stackoverflow.com/a/27104504/1709587的示例PDF上尝试），而是直接生成一个包含LTChar对象的LTFigure。原则上，你可以弄清楚如何将这些拼接成一个字符串，但PDFMiner（截至20181108版本）无法为你做到这一点。

不过，希望你需要解析的PDF不使用包含文本的表单X对象，这样这个限制就不会影响到你。

回答于 2025-04-18 由 Python大师

分享举报

27

我得先说明一下，我是 pdfminer.six 的维护者之一。这个项目是一个社区维护的版本，专门为 Python 3 设计的 pdfminer。

现在，pdfminer.six 提供了多种接口，可以从 PDF 文件中提取文本和信息。如果你想通过编程的方式提取信息，我建议使用 extract_pages() 这个方法。它可以让你查看页面上所有的元素，并且这些元素是按照一种有意义的层级关系排列的，这种关系是由布局算法生成的。

下面的例子展示了一种 Python 风格的方式来显示层级中的所有元素。它使用的是 pdfminer.six 的示例目录中的 simple1.pdf 文件。

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
    """Show location and text of LTItem and all its descendants"""
    if depth == 0:
        print('element                        x1  y1  x2  y2   text')
        print('------------------------------ --- --- --- ---- -----')

    print(
        f'{get_indented_name(o, depth):<30.30s} '
        f'{get_optional_bbox(o)} '
        f'{get_optional_text(o)}'
    )

    if isinstance(o, Iterable):
        for i in o:
            show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
    """Indented name of LTItem"""
    return '  ' * depth + o.__class__.__name__


def get_optional_bbox(o: Any) -> str:
    """Bounding box of LTItem if available, otherwise empty string"""
    if hasattr(o, 'bbox'):
        return ''.join(f'{i:<4.0f}' for i in o.bbox)
    return ''


def get_optional_text(o: Any) -> str:
    """Text of LTItem if available, otherwise empty string"""
    if hasattr(o, 'get_text'):
        return o.get_text().strip()
    return ''


path = Path('~/Downloads/simple1.pdf').expanduser()

pages = extract_pages(path)
show_ltitem_hierarchy(pages)

输出结果会显示层级中的不同元素，每个元素的边界框，以及这个元素包含的文本。

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
generator                       
  LTPage                       0   0   612 792  
    LTTextBoxHorizontal        100 695 161 719  Hello
      LTTextLineHorizontal     100 695 161 719  Hello
        LTChar                 100 695 117 719  H
        LTChar                 117 695 131 719  e
        LTChar                 131 695 136 719  l
        LTChar                 136 695 141 719  l
        LTChar                 141 695 155 719  o
        LTChar                 155 695 161 719  
        LTAnno                  
    LTTextBoxHorizontal        261 695 324 719  World
      LTTextLineHorizontal     261 695 324 719  World
        LTChar                 261 695 284 719  W
        LTChar                 284 695 297 719  o
        LTChar                 297 695 305 719  r
        LTChar                 305 695 311 719  l
        LTChar                 311 695 324 719  d
        LTAnno                  
    LTTextBoxHorizontal        100 595 161 619  Hello
      LTTextLineHorizontal     100 595 161 619  Hello
        LTChar                 100 595 117 619  H
        LTChar                 117 595 131 619  e
        LTChar                 131 595 136 619  l
        LTChar                 136 595 141 619  l
        LTChar                 141 595 155 619  o
        LTChar                 155 595 161 619  
        LTAnno                  
    LTTextBoxHorizontal        261 595 324 619  World
      LTTextLineHorizontal     261 595 324 619  World
        LTChar                 261 595 284 619  W
        LTChar                 284 595 297 619  o
        LTChar                 297 595 305 619  r
        LTChar                 305 595 311 619  l
        LTChar                 311 595 324 619  d
        LTAnno                  
    LTTextBoxHorizontal        100 495 211 519  H e l l o
      LTTextLineHorizontal     100 495 211 519  H e l l o
        LTChar                 100 495 117 519  H
        LTAnno                  
        LTChar                 127 495 141 519  e
        LTAnno                  
        LTChar                 151 495 156 519  l
        LTAnno                  
        LTChar                 166 495 171 519  l
        LTAnno                  
        LTChar                 181 495 195 519  o
        LTAnno                  
        LTChar                 205 495 211 519  
        LTAnno                  
    LTTextBoxHorizontal        321 495 424 519  W o r l d
      LTTextLineHorizontal     321 495 424 519  W o r l d
        LTChar                 321 495 344 519  W
        LTAnno                  
        LTChar                 354 495 367 519  o
        LTAnno                  
        LTChar                 377 495 385 519  r
        LTAnno                  
        LTChar                 395 495 401 519  l
        LTAnno                  
        LTChar                 411 495 424 519  d
        LTAnno                  
    LTTextBoxHorizontal        100 395 211 419  H e l l o
      LTTextLineHorizontal     100 395 211 419  H e l l o
        LTChar                 100 395 117 419  H
        LTAnno                  
        LTChar                 127 395 141 419  e
        LTAnno                  
        LTChar                 151 395 156 419  l
        LTAnno                  
        LTChar                 166 395 171 419  l
        LTAnno                  
        LTChar                 181 395 195 419  o
        LTAnno                  
        LTChar                 205 395 211 419  
        LTAnno                  
    LTTextBoxHorizontal        321 395 424 419  W o r l d
      LTTextLineHorizontal     321 395 424 419  W o r l d
        LTChar                 321 395 344 419  W
        LTAnno                  
        LTChar                 354 395 367 419  o
        LTAnno                  
        LTChar                 377 395 385 419  r
        LTAnno                  
        LTChar                 395 395 401 419  l
        LTAnno                  
        LTChar                 410 395 424 419  d
        LTAnno

（类似的回答可以在这里、这里和这里找到，我会尽量保持它们的一致性。）

回答于 2025-04-18 由 Python大师

分享举报

如何从PDF文件中提取文本及其坐标？

4 个回答

撰写回答