<p><strong>代理</strong>
我会找一家公司提供一个rotator,这样你就不必费心了,但是你可以编写一个定制的中间件,我会告诉你怎么做的。您要做的是编辑process request方法。这样既可以更改代理,也可以更改用户代理。在</p>
<p><strong>用户代理</strong>
您可以使用Scrapy random user-agent中间件<a href="https://github.com/cleocn/scrapy-random-useragent" rel="nofollow noreferrer">https://github.com/cleocn/scrapy-random-useragent</a>,或者这就是如何使用中间件(包括代理或任何其他报头)更改请求对象的任何内容。在</p>
<pre><code># middlewares.py
user_agents = ['agent1', 'agent2', 'agent3', 'agent4']
proxies = ['ip1:port1', 'ip2:port2', 'ip3:port3', 'ip4:port4'
# either have your user agents in a file or something this assumes you are able to get them into a list.
class MyMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
request.headers['User-Agent'] = random.choice(user_agents) # !! These 2 lines
request.meta['proxy'] = random.choice(proxies) # !! These 2 lines
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
# settings.py
DOWNLOADER_MIDDLEWARES = {
'project.middlewares.MyMiddleware': 543,
}
</code></pre>
<p>参考文献:
<a href="https://docs.scrapy.org/en/latest/topics/request-response.html" rel="nofollow noreferrer">https://docs.scrapy.org/en/latest/topics/request-response.html</a></p>