从pdf文件中提取元数据和url,并下载所有引用的pdf
pdfx的Python项目详细描述
简介
从pdf中提取引用(pdf、url、doi)和元数据。可以选择下载所有引用的PDF并检查是否有断开的链接。
功能
- 从给定的pdf中提取引用和元数据
- 检测PDF、URL、ARXIV和DOI引用
- fast,并行下载所有引用的PDF
- 检查断开的链接(使用-c标志)
- 输出为文本或json(使用-j标志)
- 提取pdf文本(使用--text标志)
- 用作命令行工具或python包
- 与Python2和3兼容
- 与本地和在线PDF一起使用
开始
使用easy_install或pip获取代码的副本,并运行它:
$ sudo easy_install -U pdfx ... $ pdfx <pdf-file-or-url>
运行pdfx -h查看帮助输出:
$ pdfx -h usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf Extract metadata and references from a PDF, and optionally download all referenced PDFs. Visit https://www.metachris.com/pdfx for more information. positional arguments: pdf Filename or URL of a PDF file optional arguments: -h, --help show this help message and exit -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY Download all referenced PDFs into specified directory -c, --check-links Check for broken links -j, --json Output infos as JSON (instead of plain text) -v, --verbose Print all references (instead of only PDFs) -t, --text Only extract text (no metadata or references) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output to specified file instead of console --version show program's version number and exit
示例
让我们看看这张纸:https://weakdh.org/imperfect-forward-secrecy.pdf:
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf Document infos: - CreationDate = D:20150821110623-04'00' - Creator = LaTeX with hyperref package - ModDate = D:20150821110805-04'00' - PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1 - Pages = 13 - Producer = pdfTeX-1.40.14 - Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice - Trapped = False - dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'} - pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'} - pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'} - xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'} - xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'} References: 36 - URL: 18 - PDF: 18 PDF References: - http://www.spiegel.de/media/media-35533.pdf - http://www.spiegel.de/media/media-35513.pdf - http://www.spiegel.de/media/media-35509.pdf - http://www.spiegel.de/media/media-35529.pdf - http://www.spiegel.de/media/media-35527.pdf - http://cr.yp.to/factorization/smoothparts-20040510.pdf - http://www.spiegel.de/media/media-35517.pdf - http://www.spiegel.de/media/media-35526.pdf - http://www.spiegel.de/media/media-35519.pdf - http://www.spiegel.de/media/media-35522.pdf - http://cryptome.org/2013/08/spy-budget-fy13.pdf - http://www.spiegel.de/media/media-35515.pdf - http://www.spiegel.de/media/media-35514.pdf - http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf - http://www.spiegel.de/media/media-35528.pdf - http://www.spiegel.de/media/media-35671.pdf - http://www.spiegel.de/media/media-35520.pdf - http://www.spiegel.de/media/media-35551.pdf
您可以使用-v标志来输出所有引用,而不仅仅是pdf。
使用-d(用于download-pdfs)将所有引用的PDF下载到指定目录(例如/tmp/):
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/ ...
要提取文本,可以使用-t标志:
# Extract text to console $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t # Extract text to file $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt
若要检查断开的链接,请使用-c标志:
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c
检查断开链接的示例视频:http://recordit.co/PsigiMaooH
用作python库
>>> import pdfx >>> pdf = pdfx.PDFx("filename-or-url.pdf") >>> metadata = pdf.get_metadata() >>> references_list = pdf.get_references() >>> references_dict = pdf.get_references_as_dict() >>> pdf.download_pdfs("target-directory")