Scrapy没有抓取所有页面

3 投票

1 回答

5050 浏览

提问于 2025-04-17 17:30

这是我正在使用的代码：

from scrapy.item import Item, Field

class Test2Item(Item):
    title = Field()

from scrapy.http import Request
from scrapy.conf import settings
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Khmer24Spider(CrawlSpider):
    name = 'khmer24'
    allowed_domains = ['www.khmer24.com']
    start_urls = ['http://www.khmer24.com/']
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
    DOWNLOAD_DELAY = 2

    rules = (
        Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        i = Test2Item()
        i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r')
        return i

它只能抓取10到15条记录，数量总是随机的！我无法获取所有符合这种模式的页面，比如http://www.khmer24.com/ad/any-words/67-anynumber.html

我真的怀疑Scrapy在爬取的时候因为请求重复而停止了。他们建议使用dont_filter = True，但我不知道该把它放在哪里。

我对Scrapy还是个新手，真的需要帮助。

数据提取网页抓取 scrapy 爬虫技术请求限制新手指南反爬虫机制爬虫配置

1 个回答

1. “他们建议使用 dont_filter = True，但我不知道该把它放在哪里。”

这个参数是在 BaseSpider 里，CrawlSpider 是从这个类继承过来的。(scrapy/spider.py) 默认情况下，它的值是 True。

2. “它只能抓取 10 或 15 条记录。”

原因：这是因为 start_urls 设置得不好。在这个问题中，爬虫从 http://www.khmer24.com/ 开始抓取，假设它找到了 10 个符合条件的链接。然后，爬虫继续抓取这 10 个链接。但是因为这些页面满足条件的内容太少，爬虫只能找到很少的链接（甚至没有），这就导致它停止了抓取。

可能的解决方案：我上面说的原因其实就是在重复 icecrime 的看法，解决方案也是如此。

建议使用“所有广告”页面作为 start_urls。（你也可以使用主页作为 start_urls，然后使用新的 rules。）

新的 rules:

rules = (
    # Extract all links and follow links from them 
    # (since no callback means follow=True by default)
    # (If "allow" is not given, it will match all links.)
    Rule(SgmlLinkExtractor()), 

    # Extract links matching the "ad/any-words/67-anynumber.html" pattern
    # and parse them with the spider's method parse_item (NOT FOLLOW THEM)
    Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'),
)

参考： SgmlLinkExtractor， CrawlSpider 示例

回答于 2025-04-17 由 Python大师

分享举报

Scrapy没有抓取所有页面

1 个回答

撰写回答