使用Scrapy从websi中查找和下载pdf文件

import urlparse import scrapy from scrapy.http import Request class pwc_tax(scrapy.Spider): name = "pwc_tax" allowed_domains = ["www.pwc.com"] start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"] def parse(self, response): base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html" for a in response.xpath('//a[@href]/@href'): link = a.extract() if link.endswith('.pdf'): link = urlparse.urljoin(base_url, link) yield Request(link, callback=self.save_pdf) def save_pdf(self, response): path = response.url.split('/')[-1] with open(path, 'wb') as f: f.write(response.body)

1条回答

网友

1楼 · 发布于 2024-05-28 19:02:08

蜘蛛逻辑似乎不正确。

我快速浏览了一下你的网站，似乎有几种类型的页面：

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html初始页
特定文章的网页，例如http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html，可以从第1页导航
实际的PDF位置，例如可以从第2页导航的http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf

因此，正确的逻辑是：首先获得1页，然后获得2页，然后我们可以下载3页。
然而，你的蜘蛛试图从1页中直接提取到3页的链接。

编辑：

我已经更新了你的代码，以下是一些实际可行的方法：

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

相关问题更多 >

编程相关推荐

热门问题

热门文章