使用Scrapy抓取多个页面

5 投票
1 回答
3860 浏览
提问于 2025-04-18 07:44

我正在尝试使用scrapy这个工具来抓取一个有很多信息页面的网站。

我的代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item


class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["http://www.tcgplayer.com/"]
    start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

    def parse(self, response):
        hxs = Selector(response)
        titles = hxs.xpath("//div[@class='magicCard']")
        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

我想要抓取所有的页面,直到到达页面的最后一页……有时候页面的数量会比其他情况多,所以很难准确知道页面的编号在哪里结束。

1 个回答

11

这个想法是不断增加 pageNumber 的值,直到在页面上找不到 titles 为止。如果在某个页面上没有找到 titles,就会抛出一个叫 CloseSpider 的异常,这样就会停止这个爬虫的运行。

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from tcgplayer1.items import Tcgplayer1Item


URL = "http://store.tcgplayer.com/magic/journey-into-nyx?pageNumber=%d"

class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["tcgplayer.com"]
    start_urls = [URL % 1]

    def __init__(self):
        self.page_number = 1

    def parse(self, response):
        print self.page_number
        print "----------"

        sel = Selector(response)
        titles = sel.xpath("//div[@class='magicCard']")
        if not titles:
            raise CloseSpider('No more pages')

        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

        self.page_number += 1
        yield Request(URL % self.page_number)

这个特定的爬虫会遍历所有8页的数据,然后就会停止。

希望这对你有帮助。

撰写回答