
2024-04-18 23:07:54 发布

您现在位置:Python中文网/ 问答频道 /正文



class TaylorSpider(CrawlSpider):
    name = 'Taylor'
    allowed_domains = ['tandfonline.com']
    start_urls = ['http://www.tandfonline.com/action/cookieAbsent']

    def start_requests(self):  
        yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url) 

    def parse_start_url(self, response):
        item = TaylorspiderItem()
        item['PageUrl'] = response.url      

        yield item

# middleware.py

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        request.meta['proxy'] = ''
        return request        

# setting.py

    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'TaylorSpider.middlewares.ProxyMiddleware': 100,




2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)


Tags: debugselfinfocomhttpdatetimerequestdef
1楼 · 发布于 2024-04-18 23:07:54

如果Downloader middlewares' ^{}只修补请求并希望框架继续其处理,则应返回None

process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.

If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).


If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

所以你想把return request放在你的process_request末尾。在

相关问题 更多 >