使用PDFminer库时出现 "AttributeError: 'NoneType' object has no attribute 'getobj

0 投票

2 回答

3828 浏览

提问于 2025-04-17 16:09

我正在写一个脚本，用来上传PDF文件，并在这个过程中解析它们。解析的时候我使用的是PDFminer这个工具。

为了把文件转换成PDFMiner可以处理的文档，我使用了下面这个函数，按照上面链接里的说明来做的：

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    doc.set_parser(parser)
    if self.options['password']:
        password = self.options['password']
    else:
        password = ""
    doc.initialize(password)
    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

我本来希望能得到一个漂亮的PDFDocument实例，但结果却出现了错误：

Traceback (most recent call last):
  File "bzk_pdf.py", line 45, in <module>
    cli.run_cli(BZKPDFScraper)
  File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
    instance = cls(options)
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
    self.doc = self.load_document()
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
    doc.set_parser(parser)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
    x = x.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
    return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'

我不知道该从哪里查找问题，也没有找到其他人遇到同样的问题。

这里有一些可能有用的额外信息：

这是我的测试文件：http://www.2shared.com/document/kM_wrI3J/testpdf.html
_file是一个django文件对象，但使用普通文件也会得到同样的结果
pdfminer版本：'pdfminer-20110515'
Django：1.4.3（我觉得这不太重要）
Python 2.7.3

错误处理文件上传版本兼容性文档转换 attributeerror PDF解析 pdfminer django文件对象

2 个回答

试着打开这个文件，然后把它发送给解析器，像这样：

with open(_file,'rb') as f:
    parser = PDFParser(f)
    # your normal code here

你现在的做法，我怀疑你是把文件名当作字符串发送的。

回答于 2025-04-17 由 Python大师

分享举报

经过一些实验，我发现我漏掉了一行代码：

parser.set_document(doc)

加上这一行后，函数现在可以正常工作了。

我觉得这可能是库设计得不太好，但也有可能是我遗漏了什么，这只是修补了错误。

无论如何，我现在有了一个包含我需要的数据的PDF文档。

这是最终的结果：

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)

    if 'password' in self.options.keys():
        password = self.options['password']
    else:
        password = ""

    doc.initialize(password)

    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

回答于 2025-04-17 由 Python大师

分享举报

使用PDFminer库时出现 "AttributeError: 'NoneType' object has no attribute 'getobj

2 个回答

撰写回答