理解Scrapy的CrawlSpider规则

Question

我在使用继承自CrawlSpider的自己的爬虫时，遇到了一些问题，特别是关于规则字段的使用。我这个爬虫是想在旧金山的黄页上爬取披萨店的信息。

我尝试把规则设置得简单一些，只是想看看爬虫是否能通过响应中的链接进行爬取，但我没有看到这样的情况。我的结果只是发出了请求去获取下一页，然后又发出了请求去获取后续的页面。

我有两个问题： 1. 当响应到达时，爬虫是先处理规则再调用回调函数吗？还是相反？ 2. 规则是什么时候应用的？

编辑： 我搞明白了。我重写了CrawlSpider中的解析方法。查看那个类中的解析方法后，我意识到正是在那里检查规则并爬取那些网站。

注意：要知道你在重写什么

这是我的代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import Selector
from yellowPages.items import YellowpagesItem
from scrapy.http import Request

class YellowPageSpider(CrawlSpider):
    name = "yellowpages"
    allowed_domains = ['www.yellowpages.com']
    businesses = []

    # start with one page
    start_urls = ['http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza']

    rules = (Rule (SgmlLinkExtractor()
    , callback="parse_items", follow= True),
    )

    base_url = 'http://www.yellowpages.com'

    def parse(self, response):
        yield Request(response.url, callback=self.parse_business_listings_page)

    def parse_items(self, response):
        print "PARSE ITEMS. Visiting %s" % response.url
        return []

    def parse_business_listings_page(self, response):
        print "Visiting %s" % response.url

        self.businesses.append(self.extract_businesses_from_response(response))
        hxs = Selector(response)
        li_tags = hxs.xpath('//*[@id="main-content"]/div[4]/div[5]/ul/li')
        next_exist = False

        # Check to see if there's a "Next". If there is, store the links.
        # If not, return. 
        # This requires a linear search through the list of li_tags. Is there a faster way?
        for li in li_tags:
            li_text = li.xpath('.//a/text()').extract()
            li_data_page = li.xpath('.//a/@data-page').extract()
            # Note: sometimes li_text is an empty list so check to see if it is nonempty first
            if (li_text and li_text[0] == 'Next'):
                next_exist = True
                next_page_num = li_data_page[0]
                url = 'http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza&page='+next_page_num
                yield Request(url, callback=self.parse_business_listings_page)

回调函数数据抓取爬虫响应处理 crawlspider 网站爬取解析方法规则

理解Scrapy的CrawlSpider规则

1 个回答

注意！

撰写回答