在scrapy+selenium中,如何使爬行器请求等待直到前一个请求完成处理?

2024-03-29 13:34:00 发布

您现在位置:Python中文网/ 问答频道 /正文

TL;博士

在scrapy中,我希望请求等待所有spider解析回调完成。因此,整个过程需要循序渐进。像这样:

Request1 -> Crawl1 -> Request2 -> Crawl2 ...

但现在发生的是:

Request1 -> Request2 -> Request3 ...
            Crawl1      
                        Crawl2
                                 Crawl3 ...

长版本

我对scrapy+selenium网页抓取是新手。 我正试图抓取一个网站,那里的内容正在用javascript大量更新。首先,我用selenium打开网站并登录。之后,我将使用downloader中间件创建一个应用程序,该中间件使用selenium处理请求并返回响应。下面是中间件的process_request实现:

class XYZDownloaderMiddleware:
    '''Other functions are as is. I just changed this one'''
    def process_request(self, request, spider):
        driver = request.meta['driver']

        # We are opening a new link
        if request.meta['load_url']:
            driver.get(request.url)
            WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
        # We are clicking on an element to get new data using javascript.
        elif request.meta['click_bet']:
            element = request.meta['click_bet']
            element.click()
            WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))

        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)

在设置中,我还设置了CONCURRENT_REQUESTS = 1,以便不调用多个driver.get(),并且selenium可以一个接一个地和平加载响应

现在我看到的是selenium打开每个URL,scrapy让selenium等待响应完成加载,然后中间件正确地返回响应(转到if response.meta['load_url']块)

但是,在得到响应后,我希望使用selenium驱动程序(在parse(response)函数中)通过生成请求并从中间件返回更新后的HTML来单击每个元素(在elif request.meta['click_bet']块中)

蜘蛛至少是这样的:

class XYZSpider(scrapy.Spider):
    def start_requests(self):
        start_urls = [
            'https://www.example.com/a',
            'https://www.example.com/b'
        ]
        self.driver = self.getSeleniumDriver()
        for url in start_urls:
            request = scrapy.Request(url=url, callback=self.parse)
            request.meta['driver'] = self.driver
            request.meta['load_url'] = True
            request.meta['wait_for_xpath'] = '/div/bla/bla'
            request.meta['click_bet'] = None
            yield request


    def parse(self, response):
        urls = response.xpath('//a/@href').getall()
        for url in start_urls:
            request = scrapy.Request(url=url, callback=self.rightSectionParse)
            request.meta['driver'] = self.driver
            request.meta['load_url'] = True
            request.meta['wait_for_xpath'] = '//div[contains(@class, "rightSection")]'
            request.meta['click_bet'] = None
            yield request
    def rightSectionParse(self, response):
        ...

所以现在发生的事情是,scrapy并没有等待爬行器解析。Scrapy获取响应,然后并行调用parse callback和next fetch响应。但是在下一个请求处理之前,解析回调函数需要使用selenium驱动程序

我希望请求等待解析回调完成


Tags: 中间件selfurlforresponserequestdefdriver