Scrapy遇到异常时会自动重试吗？

0 投票

1 回答

985 浏览

提问于 2025-04-18 04:05

我刚刚完成了一个scrapy项目，发现日志里有这样的内容：

INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 197,
     'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 7,
     'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 190,
     'downloader/request_bytes': 2765511,
     'downloader/request_count': 8616,
     'downloader/request_method_count/GET': 8616,
     'downloader/response_bytes': 107541395,
     'downloader/response_count': 8419,
     'downloader/response_status_count/200': 8052,
     'downloader/response_status_count/301': 144,
     'downloader/response_status_count/302': 223,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 4, 24, 13, 35, 38, 955000),
     'item_scraped_count': 7861,
     'log_count/ERROR': 4,
     'log_count/INFO': 7918,
     'request_depth_max': 20,
     'response_received_count': 8052,
     'scheduler/dequeued': 8616,
     'scheduler/dequeued/memory': 8616,
     'scheduler/enqueued': 8616,
     'scheduler/enqueued/memory': 8616,
     'spider_exceptions/TypeError': 4,
     'start_time': datetime.datetime(2014, 4, 24, 12, 45, 5, 812000)}

我想知道，当scrapy遇到ResponseFailed和ResponseNeverReceived这些错误时，它会重试吗？因为结果和我预期的不太一样。理论上应该有将近三万条数据可以抓取，但它只抓到了8616条。这是我第二次运行这个项目。第一次运行时，只抓到了7000条。通过查询数据库，我发现总共有9035条独特的数据，这意味着第一次和第二次抓取的内容有重叠，但也有一些是第一次没有抓到的。为什么会出现这种情况呢？

异常处理数据抓取数据完整性 scrapy 自动重试爬虫日志

1 个回答

我之前也遇到过类似的问题，直到几个小时前才解决。问题在于，Request对象默认会过滤掉相同的请求。要改变这个行为，你需要把Request中的'dont_filter'参数设置为True。

这个设置是为了避免爬虫陷入循环，但如果你是在进行横向爬取（也就是有很多网址，但每个网址只访问一次，就像提到的3万条网址那样），那就不会有问题。

你可以在这里查找dont_filter的相关信息：

http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=dont_filter#request-objects

回答于 2025-04-18 由 Python大师

分享举报

Scrapy遇到异常时会自动重试吗？

1 个回答

撰写回答