按页阅读pdf页面

2024-05-15 22:12:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我寻找我的问题,但没有在两个可用的问题中得到答案

  1. Extract text per page with Python pdfMiner?

  2. PDFMiner - Iterating through pages and converting them to text

基本上,我想遍历每个页面,因为我只想选择具有特定文本的页面。

我用过pyPdf。它几乎可以为我所说的90%的pdfs工作,但有时它不能从页面中提取信息。

我使用了以下代码:

import pyPdf
extract = ""        
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
  ex = pdf.getPage(6)
  ex = ex.extractText()
  if re.search(r"to be held (at|on)",ex.lower()):
    print 'yes'
    print  ex ,"\n"
    extract = extract + ex + "\n" 
    continue

上面的代码可以工作,但有时有些页面无法提取。

我也尝试过使用pdfminer,但是我找不到如何逐页迭代pdf。pdfminer返回pdf的整个文本。

我使用了以下代码:

def convert_pdf_to_txt(path):
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  fp = file(path, 'rb')
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  pagenos=set()

 for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

   fp.close()
   device.close()
   retstr.close()
   return text

在上面的代码中,pdf的文本来自for循环

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

在这个例子中,我如何一次迭代一个页面。

关于pdfminer的文档是不可理解的。同样也有很多版本。

那么,有没有其他的软件包可以用于我的问题,或者pdfminer可以用于它?


Tags: to代码textforpdfpagepassword页面
2条回答

我知道回答你自己的问题不好,但我想我可能已经找到了这个问题的答案。

我认为这不是最好的方法,但它仍然帮助我。

我使用了pypdfpdfminer的组合

代码如下:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

也许有更好的方法,但目前我发现这是相当好的。

因为retstr将保留每个页面,所以您可以考虑通过调用retstr.truncate(0)来更改代码,该函数每次都清除字符串,否则您将打印每次已读内容的全部内容:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    retstr.truncate(0)
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

相关问题 更多 >