Scrapy - 在DOWNLOAD MIDDLEWARE的__init__中获取爬虫变量
我正在做一个Scrapy项目,写了一个下载中间件,用来避免向已经在数据库中的网址发送请求。
DOWNLOADER_MIDDLEWARES = {
'imobotS.utilities.RandomUserAgentMiddleware': 400,
'imobotS.utilities.DupFilterMiddleware': 500,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
我的想法是在__init__方法中连接并加载一个当前存储在数据库中的网址列表,如果抓取到的内容已经在数据库中,就抛出IgnoreRequests来忽略这个请求。
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_set = self.db.ad.find({'site': 'WEBSITE_NAME'}).distinct('url')
def process_request(self, request, spider):
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_set:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None
所以,我想在初始化时根据WEBSITE_NAME来限制网址列表,有没有办法在下载中间件的__init__方法中识别当前的爬虫名称呢?
3 个回答
0
是的,你可以在你的中间件里访问爬虫的名字。方法是定义一个叫做 from_crawler
的类方法,然后把爬虫打开的信号连接到一个叫 spider_opened
的函数。这样,你就可以在你的中间件类里保存爬虫的名字了。
from scrapy import signals
def __init__(self, crawler):
self.crawler = crawler
self.spider_name = None
return
@classmethod
def from_crawler(cls, crawler):
ext = cls(crawler)
# connect the middleware object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
# return the middleware object
return ext
def spider_opened(self, spider):
self.spider_name = spider.name
想了解更多关于信号的信息,可以查看 信号 的相关文档。
2
根据@Ahsan Roy上面说的,你不一定要使用信号API(至少在Scrapy 2.4.0版本中是这样的):
通过from_crawler
这个方法,你可以访问到爬虫(包括它的名字)以及所有其他的爬虫设置。你可以利用这些信息,把你想要的任何参数传递给你的中间件类的构造函数(也就是__init__
):
class DuplicateFilterMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
"""This method is called by Scrapy and needs to return an instance of the middleware"""
return cls(crawler.spider, crawler.settings)
def __init__(self, spider, settings):
self.spider_name = spider.name
self.settings = settings
def process_request(self, request, spider):
print("spider {s} is processing stuff".format(s=self.spider_name))
return None # keep processing normally
2
你可以把获取网址的操作放到 process_request
里,然后检查一下你之前是否已经获取过这个网址。
class DuplicateFilterMiddleware(object):
def __init__(self):
connection = pymongo.Connection('localhost', 12345)
self.db = connection['my_db']
self.db.authenticate('scott', '*****')
self.url_sets = {}
def process_request(self, request, spider):
if not self.url_sets.get(spider._site_name):
self.url_sets[spider._site_name] = self.db.ad.find({'site': spider._site_name}).distinct('url')
print "%s - process Request URL: %s" % (spider._site_name, request.url)
if request.url in self.url_sets[spider._site_name]:
raise IgnoreRequest("Duplicate --db-- item found: %s" % request.url)
else:
return None