蹩脚的SgmlLinkExtractor规则和回调会让人头痛

class SpiderSpider(CrawlSpider): name = "lolies" allowed_domains = ["domain.com"] start_urls = ['http://www.domain.com/directory/lol2'] rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly)) def parse_loly(self, response): print 'Hi this is the loly page %s' % response.url return

1条回答

网友

1楼 · 发布于 2024-05-14 10:06:38

parse_loly的空白似乎没有正确对齐。Python对空格敏感，因此对于解释器来说，它看起来像是SpiderSpider之外的方法。在

您可能还希望按照PEP8将规则行拆分为较短的行。在

试试这个：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2/']
    rules = (
        Rule(SgmlLinkExtractor(allow=('\w+$', ))), 
        Rule(SgmlLinkExtractor(allow=('\w+/\d+$', ))),
        Rule(SgmlLinkExtractor(allow=('\d+$',)), callback='parse_loly'),
    )

    def parse_loly(self, response):
        print 'Hi this is the loly page %s' % response.url
        return None

相关问题更多 >

编程相关推荐

热门问题

热门文章