单个单词的PDFMiner提取LTText LTTextBox

from pdfminer.layout import LAParams, LTTextBox, LTText from pdfminer.pdfpage import PDFPage from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager from pdfminer.converter import PDFPageAggregator #Imports Searchable PDFs and prints x,y coordinates fp = open('C:\sample.pdf', 'rb') manager = PDFResourceManager() laparams = LAParams() dev = PDFPageAggregator(manager, laparams=laparams) interpreter = PDFPageInterpreter(manager, dev) pages = PDFPage.get_pages(fp) for page in pages: print('--- Processing ---') interpreter.process_page(page) layout = dev.get_result() for lobj in layout: if isinstance(lobj, LTText): x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text() print('At %r is text: %s' % ((x, y), text))

--- Processing --- At (57.375, 747.903) is text: A Simple PDF File At (69.25, 698.098) is text: This is a small demonstration .pdf file - At (69.25, 674.194) is text: just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text.

1条回答

网友

1楼 · 发布于 2024-04-25 14:51:44

使用PDFMiner，在遍历每一行之后（就像您已经做的那样），您只能遍历该行中的每个字符。你知道吗

我用下面的代码完成了这项工作，同时尝试记录每个单词的第一个字符的x，y，并设置一个条件来拆分每个LTAnno（例如。\n）或.get_text() == ' '空白处的单词。你知道吗

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)

for page in pages:
    print(' - Processing  -')
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, text = -1, -1, ''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              # If the char is a line-break or an empty space, the word is complete
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                  print('At %r is text: %s' % ((x, y), text))
                x, y, text = -1, -1, ''     
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, = char.bbox[0], char.bbox[3]    
    # If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
    if x != -1:
      print('At %r is text: %s' % ((x, y), text))

输出如下所示

At (64.881, 747.903) is text: A
At (90.396, 747.903) is text: Simple
At (180.414, 747.903) is text: PDF
At (241.92, 747.903) is text: File

也许您可以优化条件，以检测符合您的需求和喜好的词语。（例如，剪切标点符号。！？（词尾）

相关问题更多 >

编程相关推荐

热门问题

热门文章