使用起始url在分页链接中使用scrapy进行刮除不起作用

2024-04-25 20:28:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图废除一个网站,其中有分页链接,所以我这样做了

import scrapy

class SummymartSpider(scrapy.Spider):
    name = 'dummymart'
    allowed_domains = ['www.dummrmart.com/product']
    start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s'% page for page in range(1,20)]

成功了!!对于单个url,它可以工作,但当我尝试这样做时:

  import scrapy
    class DummymartSpider(scrapy.Spider):
        name = 'dummymart'
        allowed_domains = ['www.dummymart.com/product']
        start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s',
        'https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]

它不工作,我如何实现相同的逻辑,但多个网址?谢谢


Tags: namehttpsimportcomnetwwwpageproduct
1条回答
网友
1楼 · 发布于 2024-04-25 20:28:36

一种方法是使用scrapy.Spiderstart_requests()方法,而不是使用start_urls属性。You can see more here

import scrapy

class DummymartSpider(scrapy.Spider):
    name = 'dummymart'
    allowed_domains = ['dummymart.com']

    def start_requests(self):
        for page in range(1,20):
            yield scrapy.Request(
                url='https://www.dummymart.net/product/auto-parts 118?page%s' % page,
                callback=self.parse,
            )
            yield scrapy.Request(
                url='https://www.dummymart.net/product/accessories-tools 112?id=1316264860?page%s' % page,
                callback=self.parse,
            )

如果您想继续使用start_urls属性,可以尝试这样的方法(我还没有测试它):

start_urls = ['https://www.dummymart.net/product/auto-parts 118?page%s' % page for page in range(1,20)] + ['https://www.dummymart.net/product/accessories-tools 112?id=1316264860?page%s'% page for page in range(1,20)]

还要注意,在allowed_domains属性中,您只需要指定域。See here。你知道吗

相关问题 更多 >