Python pdftotree包_程序模块 - PyPI

将PDF解析为类似HTML的树。

pdftotree的Python项目详细描述

Fonduer已成功扩展以从中执行信息提取格式丰富的数据，如表。在这个过程中，关键的一步是构建文本块等上下文对象的层次树，图形、表格等。系统当前使用提供的pdf到html转换由Adobe Acrobat提供。但是，adobe acrobat不是一个开源工具，它可能对Fonduer用户不方便。

这个包是我们构建自己的模块来替代Adobe的结果杂技演员。有几种开源工具可用于pdf到html的转换，但是这些工具不会保留表中的单元格结构。我们的目标是项目是开发一个工具，可以提取pdf格式的文本、图形和表格使用树数据记录和维护文档的结构结构。

依赖关系

您需要安装python3工具包：

$ sudo apt install python3-tk

安装

要从pypi安装此软件包：

$ pip install pdftotree

用法

pdftotree作为python包

importpdftotreepdftotree.parse(pdf_file,html_path=None,model_type=None,model_path=None,favor_figures=True,visualize=False):

pdftotree

这是这个python包提供的主要命令行实用程序。它以pdf文件作为输入，并生成数据：

usage: pdftotree [options] pdf_file

Script to extract tree structure from PDF files. Takes a PDF as input and
outputs an HTML-like representation of the document's structure. By default,
this conversion is done using heuristics. However, a model can be provided as
a parameter to use a machine-learning-based approach.

positional arguments:
  pdf_file              PDF file name for which tree structure needs to be
                        extracted

optional arguments:
  -h, --help            show this help message and exit
  -mt {vision,ml,None}, --model_type {vision,ml,None}
                        Model type to use. None (default) for heuristics
                        approach.
  -m MODEL_PATH, --model_path MODEL_PATH
                        Pretrained model, generated by extract_tables tool
  -o OUTPUT, --output OUTPUT
                        Path where tree structure should be saved. If none,
                        HTML is printed to stdout.
  -f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
                        Whether figures must be favored over other parts such
                        as tables and section headers
  -V, --visualize       Whether to output visualization images for the tree
  -d, --dry-run         Run pdftotree, but do not save any output or print to
                        console.
  -v, --verbose         Output INFO level logging.
  -vv, --veryverbose    Output DEBUG level logging.

提取表

这个工具训练机器学习模型来提取表格。输出模型可以用作pdftotree：

的输入

usage: extract_tables [-h] [--mode MODE] --model-path MODEL_PATH
                      [--train-pdf TRAIN_PDF] --test-pdf TEST_PDF
                      [--gt-train GT_TRAIN] --gt-test GT_TEST --datapath
                      DATAPATH [--iou-thresh IOU_THRESH] [-v] [-vv]

Script to extract tables bounding boxes from PDF files using machine learning.
If `model.pkl` is saved in the model-path, the pickled model will be used for
prediction. Otherwise the model will be retrained. If --mode is test (by
default), the script will create a .bbox file containing the tables for the
pdf documents listed in the file --test-pdf. If --mode is dev, the script will
also extract ground truth labels for the test data and compute statistics.

optional arguments:
  -h, --help            show this help message and exit
  --mode MODE           Usage mode dev or test, default is test
  --model-path MODEL_PATH
                        Path to the model. If the file exists, it will be
                        used. Otherwise, a new model will be trained.
  --train-pdf TRAIN_PDF
                        List of pdf file names used for training. These files
                        must be saved in the --datapath directory. Required if
                        no pretrained model is provided.
  --test-pdf TEST_PDF   List of pdf file names used for testing. These files
                        must be saved in the --datapath directory.
  --gt-train GT_TRAIN   Ground truth train tables. Required if no pretrained
                        model is provided.
  --gt-test GT_TEST     Ground truth test tables.
  --datapath DATAPATH   Path to directory containing the input documents.
  --iou-thresh IOU_THRESH
                        Intersection over union threshold to remove duplicate
                        tables
  -v                    Output INFO level logging
  -vv                   Output DEBUG level logging

pdf列表格式

PDF列表只是每行的一个文件名。例如：

1-s2.0-S000925411100369X-main.pdf
1-s2.0-S0009254115301030-main.pdf
1-s2.0-S0012821X12005717-main.pdf
1-s2.0-S0012821X15007487-main.pdf
1-s2.0-S0016699515000601-main.pdf

地面真相文件格式

基本事实的格式与pdf列表相同。也就是说，第一行中的第一个文档的标签。相应的pdf列表。标签采用分号分隔元组的形式包含值(page_num, page_width, page_height, top, left, bottom, right)。例如：

(10, 696, 951, 634, 366, 832, 653);(14, 696, 951, 720, 62, 819, 654);(4, 696, 951, 152, 66, 813, 654);(7, 696, 951, 415, 57, 833, 647);(8, 696, 951, 163, 370, 563, 652)
(11, 713, 951, 97, 47, 204, 676);(11, 713, 951, 261, 45, 357, 673);(3, 713, 951, 110, 44, 355, 676);(8, 713, 951, 763, 55, 903, 687)
(5, 672, 951, 88, 57, 203, 578);(5, 672, 951, 593, 60, 696, 579)
(5, 718, 951, 131, 382, 403, 677)
(13, 713, 951, 119, 56, 175, 364);(13, 713, 951, 844, 57, 902, 363);(14, 713, 951, 109, 365, 164, 671);(8, 713, 951, 663, 46, 890, 672)

标记这些表的一种方法是使用DocumentAnnotation，它允许在Web浏览器中选择表区域并生成边界框文件。

示例数据集：古生物学论文

完整的文档和基本事实标签可在此处下载： PaleoDocs。您可以训练机器学习模型，通过下载此数据集并将其提取到名为data的目录中，然后然后运行下面的命令。仔细检查命令中的路径匹配下载数据的位置：

$ extract_tables --train-pdf data/paleo/ml/train.pdf.list.paleo.not.scanned --gt-train data/paleo/ml/gt.train --test-pdf data/paleo/ml/test.pdf.list.paleo.not.scanned --gt-test data/paleo/ml/gt.test --datapath data/paleo/documents/ --model-path data/model.pkl

此示例命令的结果模型将保存为 data/model.pkl。

对于开发人员

我们正在跟踪Semantic Versioning 2.0.0 习俗。维护人员将为每个版本和相应地增加在version file中找到的版本号。我们使用travis ci自动将标记部署到pypi。

测试

要测试包中的更改，请将其安装在本地的editable mode中您的virtualenv通过运行：

$ make dev

这也将安装我们用于强制代码样式的所有工具。

然后您可以运行我们的测试：

$ make test

欢迎加入QQ群-->： 979659372

pdftotree 0.4.0

pdftotree的Python项目详细描述

依赖关系

安装

用法

pdftotree作为python包

pdftotree

提取表

对于开发人员

测试

推荐PyPI第三方库

acme.hello

bib

ejudge

Pushl

split-quer

django-cache-utils2

bits-backupif

CommonEnvironment

UCCA

OpenPIV

fxscript

trend

django_compressor_mako

django-qa

twitsilver

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

pdftotree 0.4.0

pdftotree的Python项目详细描述

依赖关系

安装

用法

pdftotree作为python包

pdftotree

提取表

对于开发人员

测试

推荐PyPI第三方库

acme.hello

bib

ejudge

Pushl

split-quer

django-cache-utils2

bits-backupif

CommonEnvironment

UCCA

OpenPIV

fxscript

trend

django_compressor_mako

django-qa

twitsilver

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签