为什么这个蹩脚的代理中间件发出重复的请求?

2024-04-18 23:07:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用代理中间件将代理添加到spider中,但我不知道它为什么过滤重复的请求

代码如下:

class TaylorSpider(CrawlSpider):
    name = 'Taylor'
    allowed_domains = ['tandfonline.com']
    start_urls = ['http://www.tandfonline.com/action/cookieAbsent']

    def start_requests(self):  
        yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url) 

    def parse_start_url(self, response):
        item = TaylorspiderItem()
        item['PageUrl'] = response.url      

        yield item

# middleware.py

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        logger.info('pr........................')
        request.meta['proxy'] = 'http://58.16.86.239:8080'
        return request        


# setting.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'TaylorSpider.middlewares.ProxyMiddleware': 100,
}      

dont_filter=True陷入无限循环时,日志是

^{pr2}$

但是当dont_filter=False时,日志是

2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)

那我该怎么解决呢?在


Tags: debugselfinfocomhttpdatetimerequestdef
1条回答
网友
1楼 · 发布于 2024-04-18 23:07:54

如果Downloader middlewares' ^{}只修补请求并希望框架继续其处理,则应返回None

process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.

If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

(...)

If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

所以你想把return request放在你的process_request末尾。在

相关问题 更多 >