Scrapy未抓取所有页面

4 投票

3 回答

7928 浏览

提问于 2025-04-17 07:42

我正在尝试以非常基础的方式抓取网站。但是，Scrapy并没有抓取到所有的链接。下面我来解释一下情况：

main_page.html -> 包含指向 a_page.html、b_page.html、c_page.html 的链接
a_page.html -> 包含指向 a1_page.html、a2_page.html 的链接
b_page.html -> 包含指向 b1_page.html、b2_page.html 的链接
c_page.html -> 包含指向 c1_page.html、c2_page.html 的链接
a1_page.html -> 包含指向 b_page.html 的链接
a2_page.html -> 包含指向 c_page.html 的链接
b1_page.html -> 包含指向 a_page.html 的链接
b2_page.html -> 包含指向 c_page.html 的链接
c1_page.html -> 包含指向 a_page.html 的链接
c2_page.html -> 包含指向 main_page.html 的链接

我在 CrawlSpider 中使用了以下规则 -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

但是抓取的结果如下 -

DEBUG: 抓取到 (200) http://localhost/main_page.html> (来源: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: 抓取到 (200) http://localhost/a_page.html> (来源: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: 抓取到 (200) http://localhost/a1_page.html> (来源: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: 抓取到 (200) http://localhost/b_page.html> (来源: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: 抓取到 (200) http://localhost/b1_page.html> (来源: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: 关闭爬虫（完成）

它并没有抓取到所有的页面。

注意 - 我在 BFO 中进行了抓取，正如 Scrapy 文档中所指示的那样。

我遗漏了什么呢？

数据提取网页抓取链接解析网站结构爬虫 crawlspider 递归抓取抓取规则

3 个回答

可能很多网址是重复的。Scrapy会避免处理重复的网址，因为这样效率低下。从你的描述来看，由于你使用了跟随网址的规则，当然会有很多重复的情况。

如果你想确认一下，并在日志中看到相关信息，可以在你的 settings.py 文件中添加以下内容。

DUPEFILTER_DEBUG = True

然后你会在日志中看到类似这样的记录：

2016-09-20 17:08:47 [scrapy] DEBUG: 过滤掉了重复的请求: http://www.example.org/example.html>

回答于 2025-04-17 由 Python大师

分享举报

Scrapy 默认会过滤掉所有重复的请求。

如果你想绕过这个限制，可以使用以下方法（示例）：

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter（布尔值）– 表示这个请求不应该被调度器过滤。当你想要多次发送相同的请求时，可以用这个选项来忽略重复过滤。使用时要小心，否则可能会导致爬虫陷入循环。默认值是 False。

更多信息可以查看请求对象文档

回答于 2025-04-17 由 Python大师

分享举报

我今天遇到了类似的问题，不过我在用一个自定义的爬虫。结果发现，网站限制了我的抓取，因为我的用户代理（user agent）是 scrappy-bot。

你可以试着换一下你的用户代理，换成一个常见浏览器的那种。

还有一个你可以尝试的办法是加个延迟。有些网站会阻止抓取，如果请求之间的时间间隔太短。你可以试着设置一个下载延迟（DOWNLOAD_DELAY）为2秒，看看这样是否有效。

关于DOWNLOAD_DELAY的更多信息可以查看 http://doc.scrapy.org/en/0.14/topics/settings.html

回答于 2025-04-17 由 Python大师

分享举报

Scrapy未抓取所有页面

3 个回答

撰写回答