如何在Scrapy的CrawlSpider中访问命令行参数？

4 投票

1 回答

913 浏览

提问于 2025-04-18 04:41

我想在 scrapy crawl ... 命令行中传递一个参数，以便在扩展的 CrawlSpider 的规则定义中使用，像下面这样：

name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

我希望在 SgmlLinkExtractor 中的 allow 属性可以通过命令行参数来指定。我查了一下，发现可以在爬虫的 __init__ 方法中获取参数值，但我该如何在命令行中获取这个参数，以便在规则定义中使用呢？

命令行参数参数传递数据抓取 scrapy crawlspider 规则定义 sgml链接提取器爬虫方法

1 个回答

你可以在你的爬虫的 rules 属性里，在 __init__ 方法中进行设置，像这样：

class MySpider(CrawlSpider):

    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def __init__(self, allow=None, *args, **kwargs):
        self.rules = (
            Rule(SgmlLinkExtractor(allow=(self.allow,),)),
        )
        super(MySpider, self).__init__(*args, **kwargs)

然后你可以在命令行中这样传递 allow 属性：

scrapy crawl example.com -a allow="item\.php"

回答于 2025-04-18 由 Python大师

分享举报

如何在Scrapy的CrawlSpider中访问命令行参数？

1 个回答

撰写回答