Scrapy - 在DOWNLOAD MIDDLEWARE的__init__中获取爬虫变量

4 投票
3 回答
2091 浏览
提问于 2025-04-19 22:56

我正在做一个Scrapy项目,写了一个下载中间件,用来避免向已经在数据库中的网址发送请求。

DOWNLOADER_MIDDLEWARES = {
   'imobotS.utilities.RandomUserAgentMiddleware': 400,
   'imobotS.utilities.DupFilterMiddleware': 500,
   'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

我的想法是在__init__方法中连接并加载一个当前存储在数据库中的网址列表,如果抓取到的内容已经在数据库中,就抛出IgnoreRequests来忽略这个请求。

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')

    def process_request(self, request, spider):
        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_set:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

所以,我想在初始化时根据WEBSITE_NAME来限制网址列表,有没有办法在下载中间件的__init__方法中识别当前的爬虫名称呢?

3 个回答

0

是的,你可以在你的中间件里访问爬虫的名字。方法是定义一个叫做 from_crawler 的类方法,然后把爬虫打开的信号连接到一个叫 spider_opened 的函数。这样,你就可以在你的中间件类里保存爬虫的名字了。

from scrapy import signals

def __init__(self, crawler):
    self.crawler = crawler
    self.spider_name = None
    return

@classmethod
def from_crawler(cls, crawler):
    ext = cls(crawler)
    # connect the middleware object to signals
    crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
    # return the middleware object
    return ext

def spider_opened(self, spider):
    self.spider_name = spider.name

想了解更多关于信号的信息,可以查看 信号 的相关文档。

2

根据@Ahsan Roy上面说的,你不一定要使用信号API(至少在Scrapy 2.4.0版本中是这样的):

通过from_crawler这个方法,你可以访问到爬虫(包括它的名字)以及所有其他的爬虫设置。你可以利用这些信息,把你想要的任何参数传递给你的中间件类的构造函数(也就是__init__):

class DuplicateFilterMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        """This method is called by Scrapy and needs to return an instance of the middleware"""
        return cls(crawler.spider, crawler.settings)

    def __init__(self, spider, settings):
        self.spider_name = spider.name
        self.settings = settings

    def process_request(self, request, spider):
        print("spider {s} is processing stuff".format(s=self.spider_name))
        return None  # keep processing normally
2

你可以把获取网址的操作放到 process_request 里,然后检查一下你之前是否已经获取过这个网址。

class DuplicateFilterMiddleware(object):

    def __init__(self):
        connection = pymongo.Connection('localhost', 12345)
        self.db = connection['my_db']
        self.db.authenticate('scott', '*****')

        self.url_sets = {}

    def process_request(self, request, spider):
        if not self.url_sets.get(spider._site_name):
            self.url_sets[spider._site_name] = self.db.ad.find({'site': spider._site_name}).distinct('url')

        print "%s - process Request URL: %s" % (spider._site_name, request.url)
        if request.url in self.url_sets[spider._site_name]:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
        else:
            return None

撰写回答