Python, pyPdf OCR错误：pyPdf.utils.PdfReadError：未找到EOF标记

Question

pyPdf 抛出了这个异常：

pyPdf.utils.PdfReadError: 找不到 EOF 标记

我不需要修复 pyPdf，我只是想让 EOF 错误触发一个“except”块，以便跳过这个文件，但这并没有成功。程序还是停止运行了。

背景：

PDF 的批量 OCR 程序

Python, pyPdf, Adobe PDF OCR 错误：不支持的过滤器 /lzwdecode

... 故事还在继续。

我有 10,000 个 PDF 文件在一个文件夹里。有些经过了 OCR，有些没有。无法区分它们。第一步是找出哪些没有经过 OCR，然后只对这些文件进行 OCR（详细信息请参见其他讨论）。

所以我在使用 pyPdf。当我尝试读取文本时，遇到了一些与无法识别的字符和不支持的过滤器相关的异常。所以我猜，如果它抛出异常，说明文件里有一些文本，然后就不把它放进列表里。问题解决了，对吧？像这样：

      from pyPdf import PdfFileWriter, PdfFileReader
      import sys, os, pyPdf, re

      path = 'C:\Users\Homer\Documents\My Pdfs'

      filelist = os.listdir(path)

      has_text_list = []
      does_not_have_text_list = []

    for pdf_name in filelist:
        pdf_file_with_directory = os.path.join(path, pdf_name)
        pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))
        print pdf_name
        for i in range(0, pdf.getNumPages()):
            try:
                pdf.write("%%EOF")
                content = pdf.getPage(i).extractText()
                does_it_have_text = re.findall(r'\w{2,}', content) 
                if does_it_have_text == []:
                    does_not_have_text_list.append(pdf_name)
                    print pdf_name
                else:
                    has_text_list.append(pdf_name)
            except:
                has_text_list.append(pdf_name)

print does_not_have_text_list

但接着我遇到了这个错误：

pyPdf.utils.PdfReadError: 找不到 EOF 标记

看起来这个错误经常出现（从谷歌搜索得知）：

http://pdfposter.origo.ethz.ch/node/31

我觉得这意味着 pyPdf 打开了文件，尝试处理文本，抛出了某个异常，执行了 except: 块，但现在无法继续下一步，因为它不知道文件已经结束。

还有其他类似的讨论，他们声称这个问题已经修复，但似乎并没有。

然后有人在这里写了一个函数，先在 .pdf 文件中写入 EOF 字符。

http://code.activestate.com/lists/python-list/589529/

我尝试加入 "pdf.write("%%EOF")" 这一行来模仿这个方法，但没有成功。

那么我该如何让这个错误执行 except 块呢？我还在使用 wing IDE，如果有办法用调试器跳过这些文件，那也是可以的。谢谢。

pdf debugging exception handling ocr batch processing text extraction pdfreaderror file parsing

Python, pyPdf OCR错误：pyPdf.utils.PdfReadError：未找到EOF标记

1 个回答

撰写回答