在scrapy中,我希望请求等待所有spider解析回调完成。因此,整个过程需要循序渐进。像这样:
Request1 -> Crawl1 -> Request2 -> Crawl2 ...
但现在发生的是:
Request1 -> Request2 -> Request3 ...
Crawl1
Crawl2
Crawl3 ...
我对scrapy+selenium网页抓取是新手。
我正试图抓取一个网站,那里的内容正在用javascript大量更新。首先,我用selenium打开网站并登录。之后,我将使用downloader中间件创建一个应用程序,该中间件使用selenium处理请求并返回响应。下面是中间件的process_request
实现:
class XYZDownloaderMiddleware:
'''Other functions are as is. I just changed this one'''
def process_request(self, request, spider):
driver = request.meta['driver']
# We are opening a new link
if request.meta['load_url']:
driver.get(request.url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
# We are clicking on an element to get new data using javascript.
elif request.meta['click_bet']:
element = request.meta['click_bet']
element.click()
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)
在设置中,我还设置了CONCURRENT_REQUESTS = 1
,以便不调用多个driver.get()
,并且selenium可以一个接一个地和平加载响应
现在我看到的是selenium打开每个URL,scrapy让selenium等待响应完成加载,然后中间件正确地返回响应(转到if response.meta['load_url']
块)
但是,在得到响应后,我希望使用selenium驱动程序(在parse(response)
函数中)通过生成请求并从中间件返回更新后的HTML来单击每个元素(在elif request.meta['click_bet']
块中)
蜘蛛至少是这样的:
class XYZSpider(scrapy.Spider):
def start_requests(self):
start_urls = [
'https://www.example.com/a',
'https://www.example.com/b'
]
self.driver = self.getSeleniumDriver()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.parse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '/div/bla/bla'
request.meta['click_bet'] = None
yield request
def parse(self, response):
urls = response.xpath('//a/@href').getall()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.rightSectionParse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '//div[contains(@class, "rightSection")]'
request.meta['click_bet'] = None
yield request
def rightSectionParse(self, response):
...
所以现在发生的事情是,scrapy并没有等待爬行器解析。Scrapy获取响应,然后并行调用parse callback和next fetch响应。但是在下一个请求处理之前,解析回调函数需要使用selenium驱动程序
我希望请求等待解析回调完成
目前没有回答
相关问题 更多 >
编程相关推荐