在scrapy downloader中间件中重试请求

2024-04-16 13:11:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用scrapoxy,它在刮取时实现IP旋转

我有一个状态代码列表BLACKLIST_HTTP_STATUS_CODES,指示当前IP被阻止

问题:一旦在^{中收到状态代码为的响应,scrapoxy downloader中间件将引发IgnoreRequest,然后更改IP。结果,我的脚本跳过了响应状态代码错误的url

日志示例:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/190> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/191> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/192> (referer: None)
[spider] DEBUG: Ignoring Blacklisted response https://www.some-website.com/profile/193: HTTP status 429
[urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 13.33.33.37:8889
[urllib3.connectionpool] DEBUG: http://13.33.33.37:8889 "POST /api/instances/stop HTTP/1.1" 200 11
[spider] DEBUG: Remove: instance removed (1 instances remaining)
[spider] INFO: Sleeping 89 seconds
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/194> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/195> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/196> (referer: None)

因此,我的脚本跳过了https://www.some-website.com/profile/193

目标:我想重试其响应的状态代码在BLACKLIST_HTTP_STATUS_CODES中的请求,直到它不在该列表中

我的下载程序看起来是这样的

class BlacklistDownloaderMiddleware(object):
     def __init__(self, crawler):
         ...
    
     def from_crawler(cls, crawler):
         ...
    
     def process_response(self, request, response, spider):
        """
        Detect blacklisted response and stop the instance if necessary.
        """
        try:
            # self._http_status_codes is actually BLACKLIST_HTTP_STATUS_CODES
            if response.status in self._http_status_codes:
                # I have defined BlacklistErorr
                raise BlacklistError(response, 'HTTP status {}'.format(response.status))
            return response

        # THIS IS HOW ORIGINAL CODE LOOKS
        except BlacklistError as ex:
            # Some logs
            spider.log('Ignoring Blacklisted response {0}: {1}'.format(response.url, ex.message), level=logging.DEBUG)
            # Get the name of proxy that I need to change
            name = response.headers['x-cache-proxyname'].decode('utf-8')
            # Change the proxy
            self._stop_and_sleep(spider, name)
            # drop the url
            raise IgnoreRequest()

            # MY TRY: I have tried this instead of raising IgnoreRequest but
            # it does not work and asks for arguments spider and
            # response for self.process_response
            # return Request(response.url, callback=self.process_response, dont_filter=True)




Tags: httpscoredebugselfcomhttpresponsewww
1条回答
网友
1楼 · 发布于 2024-04-16 13:11:12

您应该像retry = request.copy()一样复制原始请求,而不是返回新的请求对象。您可以查看如何Scrapy's ^{} handles retries

供参考:

def _retry(self, request):
    ...
    retryreq = request.copy()
    retryreq.dont_filter = True
    ...
    return retryreq

你可以这样称呼它

def process_response(self, request, response, spider):
    try:
        if response.status in self._http_status_codes:
            name = response.headers['x-cache-proxyname'].decode('utf-8')
            self._stop_and_sleep(spider, name)
            return self._retry(request)
        return response

这应该给你一个想法

相关问题 更多 >