用加密的Python填写PDF表单

2024-04-20 05:11:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我的项目是自动填写德国铁路公司(deutschebahn)关于晚点列车的PDF表格。https://www.bahn.de/wmedia/view/mdb/media/intern/fahrgastrechteformular.pdf

当你打开谷歌浏览器的链接,你可以很容易地编辑文件。所以在python中也应该可以这样做。你知道吗

我尝试了多种方法:

1。使用PyPDF2

在这个堆栈溢出问题的第二个答案中建议的方法:Batch fill PDF forms from python or bash

# -*- coding: utf-8 -*-

from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader


def _getFields(obj, tree=None, retval=None, fileobj=None):
    """
    Extracts field data if this PDF contains interactive form fields.
    The *tree* and *retval* parameters are for recursive use.

    :param fileobj: A file object (usually a text file) to write
        a report to on all interactive form fields found.
    :return: A dictionary where each key is a field name, and each
        value is a :class:`Field<PyPDF2.generic.Field>` object. By
        default, the mapping name is used for keys.
    :rtype: dict, or ``None`` if form data could not be located.
    """
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        # get the AcroForm tree
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval


def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())


if __name__ == '__main__':
    from pprint import pprint

    pdf_file_name = '2PagesFormExample.pdf'

    pprint(get_form_fields(pdf_file_name))

但是,程序在解密PDF时有问题:

  File "c:\Users\User1\iCloudDrive\fahrgastrechte\fahrgastrechte.py", line 94, in <module>
    pprint(get_form_fields(pdf_file_name))
  File "c:\Users\User1\iCloudDrive\fahrgastrechte\fahrgastrechte.py", line 62, in get_form_fields
    fields = _getFields(infile)
  File "c:\Users\User1\iCloudDrive\fahrgastrechte\fahrgastrechte.py", line 32, in _getFields
    catalog = obj.trailer["/Root"]
  File "C:\Program Files\Python36\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Program Files\Python36\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Program Files\Python36\lib\site-packages\PyPDF2\pdf.py", line 1617, in getObject
    raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted

我不知道为什么解密是必要的,因为我只想在第一时间读取数据。我能理解什么时候写数据。然而,它也可以写在PDF的领域时,例如使用谷歌浏览器。你知道吗

2。使用pypdftk

一开始我只是想读一下表格的数据:

import pypdftk

pdf_file_name = './fahrgastrechteformular.pdf'
data = pypdftk.dump_data_fields(pdf_file_name)

当前我的系统(Windows 10)无法识别pdftk.exe文件pyhton模块正在调用它。所以我直接在bash中调用它:

pdftk.exe fahrgastrechteformular.pdf dum_data_fields

我还发现了一个加密错误:

Error: Failed to open PDF file:
   fahrgastrechteformular.pdf
   OWNER PASSWORD REQUIRED, but not given (or incorrect)
Error: Unable to find file.
Error: Failed to open PDF file:
   dum_data_fields
Done.  Input errors, so no output created.

所以在开始的时候我只想阅读PDF的表单域。例如,当我用googlechrome填充第一个字段“柏林中央车站”时。我想通过上面提到的python脚本来读取它。下一步是,实际编辑字段内容。希望你能跟上。有不清楚的地方请提问。你知道吗


Tags: nameinformnonetreeobjfieldsdata