如何用Python脚本抓取网站上的PDF链接

1 投票

3 回答

7370 浏览

数据工程师

提问于 2025-04-16 18:52

我经常需要从网站上下载PDF文件，但有时候这些文件并不在同一页面上。网站把链接分成了好几页，我得一个一个点击每个页面才能找到链接。

我正在学习Python，想写一个脚本，只要输入网站的网址，就能从那个网站提取出PDF链接。

我对Python还很陌生，有谁能告诉我该怎么做吗？

自动化脚本网页抓取数据采集链接提取 pdf下载

3 个回答

通过手机看，可能不太好读。

如果你想从一些静态网页或者其他内容中获取信息，可以很简单地用requests来抓取HTML。

import requests
page_content=requests.get(url)

但是如果你要抓取一些社交网站之类的内容，就会遇到一些反抓取的措施。（如何绕过这些限制就成了问题）

第一种方法：让你的请求看起来更像是一个浏览器（人类）。添加请求头（你可以用Chrome的开发者工具或者Fiddle来复制这些请求头）。确保你的表单提交方式和浏览器一致。获取cookies，并将其添加到请求中。
第二种方法：使用selenium和浏览器驱动。Selenium会使用真实的浏览器驱动（比如我用的是chromedriver）。 记得把chromedriver添加到路径中。 或者用代码加载driver.exe： driver=WebDriver.Chrome(path) 不确定这段代码是否是设置代码。

然后用driver.get(url)来访问网址。这实际上是通过浏览器访问网址，所以抓取内容会简单一些。

获取网页内容： page=driver.page_sources

有些网站会跳转几次，这可能会导致一些错误。你需要让程序等待某个特定元素出现。

尝试： certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'你知道的那个元素的ID') WebDriverWait(certain_element)

或者使用隐式等待：等待你设定的时间。

driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)

你还可以通过WebDriver来控制网站，这里就不详细说明了。你可以自己去查找相关模块。

回答于 2025-04-16 由 Python大师

分享举报

如果你有很多页面上有链接，可以试试一个很棒的工具——Scrapy（http://scrapy.org/）。这个工具很容易上手，能够帮你下载需要的PDF文件。

回答于 2025-04-16 由 Python大师

分享举报

使用 urllib2、urlparse 和 lxml 来实现这个功能非常简单。我在这里做了更详细的注释，因为你刚接触Python：

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):

    # print the href, joining it to the base_url
    print urlparse.urljoin(base_url, node.attrib['href'])

结果：

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

回答于 2025-04-16 由 Python大师

分享举报

如何用Python脚本抓取网站上的PDF链接

3 个回答

撰写回答