为什么这个蹩脚的代理中间件发出重复的请求？

class TaylorSpider(CrawlSpider): name = 'Taylor' allowed_domains = ['tandfonline.com'] start_urls = ['http://www.tandfonline.com/action/cookieAbsent'] def start_requests(self): yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url) def parse_start_url(self, response): item = TaylorspiderItem() item['PageUrl'] = response.url yield item # middleware.py class ProxyMiddleware(object): def process_request(self, request, spider): logger.info('pr........................') request.meta['proxy'] = 'http://58.16.86.239:8080' return request # setting.py DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'TaylorSpider.middlewares.ProxyMiddleware': 100, }

2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened 2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................ 2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished) 2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'dupefilter/filtered': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000), 'log_count/DEBUG': 2, 'log_count/INFO': 8, 'log_count/WARNING': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)} 2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)

1条回答

网友

1楼 · 发布于 2024-04-18 23:07:54

如果Downloader middlewares' ^{}只修补请求并希望框架继续其处理，则应返回None：

process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.
If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).
(...)
If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

所以你想把return request放在你的process_request末尾。在

相关问题更多 >

编程相关推荐

热门问题

热门文章