到用户定义的pag的递归爬网

class MySpider(CrawlSpider): name = "example" allowed_domains = ["example.com"] start_urls = ["http://www.example.com/alpha"] pattern = "/[\d]+$" rules = [ Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a', )), callback='parse_items', follow=True), ] def __init__(self, argument='' ,*a, **kw): super(MySpider, self).__init__(*a, **kw) #some inputs and operations based on those inputs i = str(raw_input()) #another input #need to change the pattern here self.pattern = '/' + i + self.pattern #some other operations pass def parse_items(self, response): hxs = HtmlXPathSelector(response) img = hxs.select('//*[@id="imgholder"]/a') item = MyItem() item["field1"] = "something" item["field2"] = "something else" yield item pass

1条回答

网友

1楼 · 发布于 2024-04-26 05:19:06

当您拥有名为__init__的Rule方法时，已经使用开头定义的模式进行了设置。你知道吗

但是，您可以在__init__方法中动态更改它。为此，在方法体中再次设置Rule并编译它（类似于这样的内容）：

def __init__(self, argument='' ,*a, **kw):
    super(MySpider, self).__init__(*a, **kw)
    # set your pattern here to what you need it
    MySpider.rules = rules = [ Rule(LinkExtractor(allow=[pattern] , restrict_xpaths=('//*[@id = "imgholder"]/a', )), callback='parse_items', follow=True), ]
    # now it is time to compile the new rules:
    super(MySpider, self)._compile_rules()

相关问题更多 >

编程相关推荐

热门问题

热门文章