如何在单个Scrapy项目中为不同的爬虫使用不同的管道

102 投票

11 回答

34114 浏览

数据工程师

提问于 2025-04-17 07:38

我有一个使用scrapy框架的项目，这个项目里有多个爬虫。请问有没有办法让我为每个爬虫指定使用哪些数据处理管道？因为我定义的并不是所有管道都适合每一个爬虫。

谢谢！

项目配置 scrapy 爬虫管理数据处理管道

11 个回答

这里提供的其他解决方案都不错，但我觉得可能会比较慢，因为我们并不是针对每个爬虫单独使用管道，而是每次返回一个项目时都要检查一下管道是否存在（在某些情况下，这个检查可能会达到几百万次）。

一个很好的方法是通过使用 custom_setting 和 from_crawler 来完全禁用（或启用）每个爬虫的某个功能，所有扩展都可以这样做：

pipelines.py

from scrapy.exceptions import NotConfigured

class SomePipeline(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):
            # if this isn't specified in settings, the pipeline will be completely disabled
            raise NotConfigured
        return cls()

    def process_item(self, item, spider):
        # change my item
        return item

settings.py

ITEM_PIPELINES = {
   'myproject.pipelines.SomePipeline': 300,
}
SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default

spider1.py

class Spider1(Spider):

    name = 'spider1'

    start_urls = ["http://example.com"]

    custom_settings = {
        'SOMEPIPELINE_ENABLED': False
    }

正如你所看到的，我们指定了 custom_settings，这会覆盖 settings.py 中的设置，并且我们为这个爬虫禁用了 SOMEPIPELINE_ENABLED。

现在，当你运行这个爬虫时，可以检查一下：

[scrapy] INFO: Enabled item pipelines: []

现在，scrapy 完全禁用了这个管道，整个运行过程中都不再考虑它的存在。你也可以检查一下这是否适用于 scrapy 的 extensions 和 middlewares。

回答于 2025-04-17 由 Python大师

分享举报

178

只需要把主设置中的所有管道都去掉，然后在爬虫里面使用这个。

这样就可以为每个爬虫定义自己的管道了。

class testSpider(InitSpider):
    name = 'test'
    custom_settings = {
        'ITEM_PIPELINES': {
            'app.MyPipeline': 400
        }
    }

回答于 2025-04-17 由 Python大师

分享举报

基于 Pablo Hoffman 的解决方案，你可以在 Pipeline 对象的 process_item 方法上使用以下装饰器，这样它就会检查你的爬虫的 pipeline 属性，看看是否应该执行这个方法。例如：

def check_spider_pipeline(process_item_method):

    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)

        # if class is in the spider's pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            spider.log(msg % 'executing', level=log.DEBUG)
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return item

    return wrapper

为了让这个装饰器正常工作，爬虫必须有一个 pipeline 属性，这个属性里要包含你想用来处理项目的 Pipeline 对象，比如：

class MySpider(BaseSpider):

    pipeline = set([
        pipelines.Save,
        pipelines.Validate,
    ])

    def parse(self, response):
        # insert scrapy goodness here
        return item

然后在一个 pipelines.py 文件中：

class Save(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do saving here
        return item

class Validate(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do validating here
        return item

所有的 Pipeline 对象仍然需要在设置中的 ITEM_PIPELINES 里定义（并且要按正确的顺序 -- 如果能让顺序也能在爬虫里指定就更好了）。

回答于 2025-04-17 由 Python大师

分享举报

如何在单个Scrapy项目中为不同的爬虫使用不同的管道

11 个回答

撰写回答