如何使用scrapy在web上刮取多个页面?

2024-04-26 17:58:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我想收集一些文章的名称和摘要。网站页面如下:

Page 1 (list of conferences):
  Conf1, year
  Conf2, yaer
  ....

Page 2 (list of articles for each Conf):
  Article1, title
  Article2, title
  ....

Page 2 (the page for each Article):
  Title
  Abstract

我想收集每一次会议的文章(以及一些其他有关会议的信息,如今年)。首先,我不知道是否需要为此使用像scrapy这样的框架,或者仅仅编写一个python程序。当我检查scrapy时,我可以有如下蜘蛛来收集会议:

# -*- coding: utf-8 -*-
import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'https://www.aclweb.org/anthology/',
    ]

    def parse(self, response):
        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[1]/tbody/tr/th/a'):
            yield {
                'name': conf.xpath('./text()').extract_first(),
                'link': conf.xpath('./@href').extract_first(),
            }

        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[2]/tbody/tr/th/a'):
            yield {
                'name': conf.xpath('./text()').extract_first(),
                'link': conf.xpath('./@href').extract_first(),
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

但是,我必须按照每个会议的链接有文章。我没有找到太多的例子来说明如何使用scrapy收集所需的其余数据。当我为每次会议收集数据时,你能指导我如何抓取文章页面吗?你知道吗


Tags: namedivformainresponseconf文章page
1条回答
网友
1楼 · 发布于 2024-04-26 17:58:51

您可以编写如下代码

import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'https://www.aclweb.org/anthology/',
    ]

    def parse(self, response):
        for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table/tbody/tr/th/a'):
            item = {'name': conf.xpath('./text()').extract_first(),
                'link': response.urljoin(conf.xpath('./@href').extract_first())}

            yield scrapy.Request(response.urljoin(conf.xpath('./@href').extract_first()), callback=self.parse_listing,
                             meta={'item': item})

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url:
            yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)

    def parse_listing(self, response):
        """
        Parse the listing page urls here
        :param response:
        :return:
        """

        # Fetch listing urls Here  == > listing_urls
        # for url in listing_urls:
        #     yield scrapy.Request(url, callback=self.parse_details)

    def parse_details(self, response):
        """
        Parse product details here
        :param response:
        :return:
        """

        # Fetch product details here. ==> details
        # yield details

您还可以查看json输出,如

scrapy crawl toscrape-xpath -o ouput.csv 

相关问题 更多 >