使用Python或其他方法从PDF提取指向另一个PDF页面的链接

6 投票
3 回答
6854 浏览
提问于 2025-04-16 17:29

我有5个PDF文件,每个文件里都有指向另一个PDF文件不同页面的链接。这些文件都是大型PDF的目录(每个大约有1000页),手动提取这些链接虽然可以,但实在是太麻烦了。目前我尝试过在Acrobat Pro中打开文件,右键点击每个链接可以看到它指向的页面,但我需要以某种方式提取所有的链接。我不介意需要进一步处理这些链接,但我就是找不到方法把它们提取出来。我试过从Acrobat Pro导出PDF为HTML或Word格式,但这两种方法都没有保留链接。

我现在快要绝望了,任何帮助都非常感谢。我对使用Python或者其他一些语言都很熟悉。

3 个回答

0

如果你不能使用Python,但有其他方法可以解压PDF内部的对象流,比如用qpdf,你可以用grep来查找URI:

qpdf --qdf --object-streams=disable input.pdf - | grep -Poa '(?<=/URI \().*(?=\))'
2

我刚刚做了一个小工具,用Python写的,专门用来列出和下载某个PDF文件中提到的所有PDF文件。你可以在这里找到它:https://www.metachris.com/pdfx/ (还有:https://github.com/metachris/pdfx

$ ./pdfx.py https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Reading url 'https://weakdh.org/imperfect-forward-secrecy.pdf'...
Saved pdf as './imperfect-forward-secrecy.pdf'
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- Pages = 13

Analyzing text...
- URLs: 49
- URLs to PDFs: 17

JSON summary saved as './imperfect-forward-secrecy.pdf.infos.json'

Downloading 17 referenced pdfs...
Created directory './imperfect-forward-secrecy.pdf-referenced-pdfs'
Downloaded 'http://cr.yp.to/factorization/smoothparts-20040510.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/smoothparts-20040510.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35517.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35517.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35514.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35514.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35519.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35519.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35522.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35522.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35509.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35509.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35528.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35528.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35513.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35513.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35533.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35533.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35551.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35551.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35527.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35527.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35520.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35520.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35526.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35526.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35515.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35515.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35529.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35529.pdf'...
Downloaded 'http://cryptome.org/2013/08/spy-budget-fy13.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/spy-budget-fy13.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35671.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35671.pdf'...

这个工具使用了PyPDF2,这是一个在Python中读取PDF内容的标准库。它还用到了正则表达式来匹配所有的网址。如果你在运行时加上-d选项(也就是--download-pdfs),它会为每个PDF启动一个下载线程。

5

我在用 pyPdf 查找一些URI(统一资源标识符),

import pyPdf

f = open('TMR-Issue6.pdf','rb')

pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for pg in range(pgs):

    p = pdf.getPage(pg)
    o = p.getObject()

    if o.has_key(key):
        ann = o[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
                print u[ank][uri]

结果是,

http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html

etc...

我没有找到一个包含链接到其他PDF文件的文件,但我怀疑URI字段应该包含像 file:///myfiles 这样的URI格式。

撰写回答