Python pdfx包_程序模块 - PyPI

从pdf文件中提取元数据和url，并下载所有引用的pdf

pdfx的Python项目详细描述

https://travis-ci.org/metachris/pdfx.svg?branch=master

https://img.shields.io/badge/license-Apache-blue.svg

简介

从pdf中提取引用（pdf、url、doi）和元数据。可以选择下载所有引用的PDF并检查是否有断开的链接。

功能

从给定的pdf中提取引用和元数据
检测PDF、URL、ARXIV和DOI引用
fast，并行下载所有引用的PDF
检查断开的链接（使用-c标志）
输出为文本或json（使用-j标志）
提取pdf文本（使用--text标志）
用作命令行工具或python包
与Python2和3兼容
与本地和在线PDF一起使用

开始

使用easy_install或pip获取代码的副本，并运行它：

$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

运行pdfx -h查看帮助输出：

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
            [--version]
            pdf

Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -c, --check-links     Check for broken links
  -j, --json            Output infos as JSON (instead of plain text)
  -v, --verbose         Print all references (instead of only PDFs)
  -t, --text            Only extract text (no metadata or references)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output to specified file instead of console
  --version             show program's version number and exit

示例

让我们看看这张纸：https://weakdh.org/imperfect-forward-secrecy.pdf：

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}

References: 36
- URL: 18
- PDF: 18

PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdf

您可以使用-v标志来输出所有引用，而不仅仅是pdf。

使用-d（用于download-pdfs）将所有引用的PDF下载到指定目录（例如/tmp/）：

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/
...

要提取文本，可以使用-t标志：

# Extract text to console
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t

# Extract text to file
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt

若要检查断开的链接，请使用-c标志：

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c

检查断开链接的示例视频：http://recordit.co/PsigiMaooH

用作python库

>>> import pdfx
>>> pdf = pdfx.PDFx("filename-or-url.pdf")
>>> metadata = pdf.get_metadata()
>>> references_list = pdf.get_references()
>>> references_dict = pdf.get_references_as_dict()
>>> pdf.download_pdfs("target-directory")

各种

作者：chris hager<；chris@linuxuser.at>；
主页：https://www.metachris.com/pdfx
许可证：apache

欢迎反馈、意见和请求！

欢迎加入QQ群-->： 979659372

推荐PyPI第三方库

导航栏
项目描述
版本历史
下载文件
项目链接
首页
标签
许可证: BSD许可证（BSD 3条款）
作者信息:: 暂无
维护者
metachris
最新PyPI项目
italian_vip_says
UFx
vofs
fake_item_generator
NerEva
django-monologue
fio_product_attribute_strict
climailsystem
pyshape
tbb-devel
npy-append-arra
anthill.tal.macrorenderer
odoo11-addon-stock-a
uuuu
contextil
fyl_nester
appomatic_renderable
teacher
chuletas
slackbot_ce
最新Python常见问题
如何在Excel中读取公式并将其转换为Python中的计算？
如何在excel中读取嵌入的excel，并将嵌入文件中的信息存储在主excel文件中？
如何在Excel中返回未知列长度的非空顶行列值？
如何在excel中选择数据列？
如何在Excel中通过脚本自动为一列中的所有单元格创建公共别名
如何在excel中高效格式化范围AttributeError:“tuple”对象没有属性“fill”
如何在excel单元格中编写python函数
如何在excel单元格中自动执行此python代码？
如何在excel工作表中创建具有相应值的新列
如何在Excel工作表中复制条件为单元格颜色的python数据框？
如何在Excel工作表中循环
如何在excel工作表中打印嵌套词典？
如何在excel工作表中绘制所有类的继承树？
如何在Excel工作表中自动调整列宽？
如何在excel工作表中追加并进一步处理

pdfx 1.3.0

pdfx的Python项目详细描述

简介

开始

示例

用作python库

各种

推荐PyPI第三方库

orange-starfrac

django-slock

logstash-api

funcutils

blessedblocks

easyspider

sendgriddjango

pyconafrica

dash_callback_chain

python-googlegeocoder

awstrust

odoo11-addons-oca-server-tools

fastqc_db

distributionsraoulmalm

pybaco

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

pdfx 1.3.0

pdfx的Python项目详细描述

简介

开始

示例

用作python库

各种

推荐PyPI第三方库

orange-starfrac

django-slock

logstash-api

funcutils

blessedblocks

easyspider

sendgriddjango

pyconafrica

dash_callback_chain

python-googlegeocoder

awstrust

odoo11-addons-oca-server-tools

fastqc_db

distributionsraoulmalm

pybaco

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签