如何使用scrapy在web上刮取多个页面？

Page 1 (list of conferences): Conf1, year Conf2, yaer .... Page 2 (list of articles for each Conf): Article1, title Article2, title .... Page 2 (the page for each Article): Title Abstract

# -*- coding: utf-8 -*- import scrapy class ToScrapeSpiderXPath(scrapy.Spider): name = 'toscrape-xpath' start_urls = [ 'https://www.aclweb.org/anthology/', ] def parse(self, response): for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[1]/tbody/tr/th/a'): yield { 'name': conf.xpath('./text()').extract_first(), 'link': conf.xpath('./@href').extract_first(), } for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[2]/tbody/tr/th/a'): yield { 'name': conf.xpath('./text()').extract_first(), 'link': conf.xpath('./@href').extract_first(), } next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first() if next_page_url is not None: yield scrapy.Request(response.urljoin(next_page_url))

1条回答

网友

1楼 · 发布于 2024-04-26 17:58:51

您可以编写如下代码

import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'https://www.aclweb.org/anthology/',
    ]

    def parse(self, response):
        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table/tbody/tr/th/a'):
            item = {'name': conf.xpath('./text()').extract_first(),
                'link': response.urljoin(conf.xpath('./@href').extract_first())}

            yield scrapy.Request(response.urljoin(conf.xpath('./@href').extract_first()), callback=self.parse_listing,
                             meta={'item': item})

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)

    def parse_listing(self, response):
        """
        Parse the listing page urls here
        :param response:
        :return:
        """

        # Fetch listing urls Here  == > listing_urls
        # for url in listing_urls:
        #     yield scrapy.Request(url, callback=self.parse_details)

    def parse_details(self, response):
        """
        Parse product details here
        :param response:
        :return:
        """

        # Fetch product details here. ==> details
        # yield details

您还可以查看json输出，如

scrapy crawl toscrape-xpath -o ouput.csv

相关问题更多 >

编程相关推荐

热门问题

热门文章