蹩脚的SgmlLinkExtractor规则和回调会让人头痛

2024-05-14 10:06:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我想做的是:

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2']
    rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly))

def parse_loly(self, response):
    print 'Hi this is the loly page %s' % response.url
    return

这让我想起:

^{pr2}$

如果我将回调改为callback="self.parse_loly",似乎永远不会调用并打印URL。在

但似乎是爬行的网站没有问题,因为我得到许多爬网200调试消息的规则。在

我会做错什么?在

提前谢谢各位!在


Tags: selfcomtrueparseresponsedomaincallbackrule
1条回答
网友
1楼 · 发布于 2024-05-14 10:06:38

parse_loly的空白似乎没有正确对齐。Python对空格敏感,因此对于解释器来说,它看起来像是SpiderSpider之外的方法。在

您可能还希望按照PEP8将规则行拆分为较短的行。在

试试这个:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2/']
    rules = (
        Rule(SgmlLinkExtractor(allow=('\w+$', ))), 
        Rule(SgmlLinkExtractor(allow=('\w+/\d+$', ))),
        Rule(SgmlLinkExtractor(allow=('\d+$',)), callback='parse_loly'),
    )

    def parse_loly(self, response):
        print 'Hi this is the loly page %s' % response.url
        return None

相关问题 更多 >

    热门问题