我想用代理中间件将代理添加到spider中,但我不知道它为什么过滤重复的请求
代码如下:
class TaylorSpider(CrawlSpider):
name = 'Taylor'
allowed_domains = ['tandfonline.com']
start_urls = ['http://www.tandfonline.com/action/cookieAbsent']
def start_requests(self):
yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url)
def parse_start_url(self, response):
item = TaylorspiderItem()
item['PageUrl'] = response.url
yield item
# middleware.py
class ProxyMiddleware(object):
def process_request(self, request, spider):
logger.info('pr........................')
request.meta['proxy'] = 'http://58.16.86.239:8080'
return request
# setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'TaylorSpider.middlewares.ProxyMiddleware': 100,
}
当dont_filter=True
陷入无限循环时,日志是
但是当dont_filter=False
时,日志是
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)
那我该怎么解决呢?在
如果Downloader middlewares' ^{} 只修补请求并希望框架继续其处理,则应返回
None
:所以你想把
return request
放在你的process_request
末尾。在相关问题 更多 >
编程相关推荐