如何在Scrapy的CrawlSpider中访问命令行参数?
我想在 scrapy crawl ...
命令行中传递一个参数,以便在扩展的 CrawlSpider 的规则定义中使用,像下面这样:
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
我希望在 SgmlLinkExtractor 中的 allow 属性可以通过命令行参数来指定。我查了一下,发现可以在爬虫的 __init__
方法中获取参数值,但我该如何在命令行中获取这个参数,以便在规则定义中使用呢?
1 个回答
5
你可以在你的爬虫的 rules
属性里,在 __init__
方法中进行设置,像这样:
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
def __init__(self, allow=None, *args, **kwargs):
self.rules = (
Rule(SgmlLinkExtractor(allow=(self.allow,),)),
)
super(MySpider, self).__init__(*args, **kwargs)
然后你可以在命令行中这样传递 allow
属性:
scrapy crawl example.com -a allow="item\.php"