使用urllib2远程读取pdf

2024-06-16 09:48:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从pdf远程提取文本。在

url为http://loc.gov/aba/publications/FreeLCC/A-text.pdf

我的代码如下

import urllib2
import PyPDF2
import io

URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf'
remote_file = urllib2.urlopen(URL).read()
memory_file = io.BytesIO(remote_file)

read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()

for i in range(0, number_of_pages):
    pageObj = read_pdf.getPage(i)
    page = pageObj.extractText()
    print (page)

我得到一个403HTTP错误。我做错什么了?在


Tags: ioimporthttpurlreadpdfremoteurllib2
1条回答
网友
1楼 · 发布于 2024-06-16 09:48:43

Source

import urllib2
import PyPDF2
import io

URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf'
req = urllib2.Request(URL, headers={'User-Agent' : "Magic Browser"}) 
remote_file = urllib2.urlopen(req).read()
memory_file = io.BytesIO(remote_file)

read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()

for i in range(0, number_of_pages):
    pageObj = read_pdf.getPage(i)
    page = pageObj.extractText()
    print (page)

相关问题 更多 >