<p>@vezunchik已经指出的关联答案几乎能让你找到答案。唯一的问题是,当您使用完全相同的代码时,您将有多个chromedriver实例。要重复使用同一个驱动程序多次,您可以尝试如下。在</p>
<p>在你的废项目中创建一个文件<code>middleware.py</code>并粘贴以下代码:</p>
<pre><code>from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumMiddleware(object):
def __init__(self):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument(" headless")
self.driver = webdriver.Chrome(options=chromeOptions)
def process_request(self, request, spider):
self.driver.get(request.url)
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
</code></pre>
<p>想出来一个更新,以防你想看看chmoedriver是如何在可见模式下遍历的。要让浏览器浏览得更清晰,请尝试以下操作:</p>
^{pr2}$
<p>使用以下脚本获取所需的内容。对于通过中间件使用selenium的每个url只有一个请求(导航)。现在,您可以在spider中使用<code>Selector()</code>来获取数据,如下面所示。在</p>
<pre><code>import sys
# The hardcoded address leads to your project location which ensures that
# you can add middleware reference within crawlerprocess
sys.path.append(r'C:\Users\WCS\Desktop\yourproject')
import scrapy
from scrapy import Selector
from scrapy.crawler import CrawlerProcess
class YPageSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ['https://www.yellowpages.com/search?search_terms=Pizza+Hut&geo_location_terms=San+Francisco%2C+CA']
def parse(self,response):
items = Selector(response)
for elem in items.css(".v-card .info a.business-name::attr(href)").getall():
yield scrapy.Request(url=response.urljoin(elem),callback=self.parse_info)
def parse_info(self,response):
items = Selector(response)
title = items.css(".sales-info > h1::text").get()
yield {"title":title}
if __name__ == '__main__':
c = CrawlerProcess({
'DOWNLOADER_MIDDLEWARES':{'yourspider.middleware.SeleniumMiddleware': 200},
})
c.crawl(YPageSpider)
c.start()
</code></pre>