当使用DownloaderMiddle软件处理第一个请求时，Scrapy似乎正在对其进行重复数据消除

def process_request(self, request: scrapy.http.Request, spider): if "Host" in request.headers: return None host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0] request.headers["Host"] = host spider.logger.info(f"Got {request}") return request

2021-10-16 21:21:08 [ficbook-spider] INFO: Got <GET https://mywebsite.com/sitemap.xml> 2021-10-16 21:21:08 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mywebsite.com/sitemap.xml>

1条回答

网友

1楼 · 发布于 2024-04-25 23:51:04

它不会处理第一个响应，也不会获取第二个响应，因为您正在从正在筛选的自定义DownloaderMiddleware process_request函数返回新请求。从文档中：

If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

如果您明确地说不要过滤第二个请求，那么它可能会起作用

def process_request(self, request: scrapy.http.Request, spider):
    if "Host" in request.headers:
        return None

    host = request.url.removeprefix("https://").removeprefix("http://").split("/")[0]
    new_req = request.replace(dont_filter=True)
    new_req.headers["Host"] = host
    spider.logger.info(f"Got {new_req}")
    return new_req

相关问题更多 >

编程相关推荐

热门问题

热门文章