基于redis的scrapy组件。
scrapy-redis的Python项目详细描述
垃圾redis
用于废料的基于redis的组件。
- 自由软件:麻省理工学院许可证
- 文档:https://scrapy-redis.readthedocs.org" rel="nofollow">https://scrapy redis.readthedocs.org
- python版本:2.7、3.4+
功能
分布式爬网/抓取
< Buff行情>可以启动共享单个redis队列的多个spider实例。 最适合广泛的多域爬网。
分布式后处理
< Buff行情>报废项目将被推送到redis队列中,这意味着您可以从 许多按需共享项目队列的后处理进程。
破旧的即插即用部件
< Buff行情>调度程序+复制筛选器、项目管道、基本蜘蛛。
要求
- python 2.7、3.4或3.5
- redis=2.8
- scrapy >;=1.0
- redis py >;=2.10
用法
在项目中使用以下设置:
# Enables scheduling storing requests queue in redis.SCHEDULER="scrapy_redis.scheduler.Scheduler"# Ensure all spiders share same duplicates filter through redis.DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"# Default requests serializer is pickle, but it can be changed to any module# with loads and dumps functions. Note that pickle is not compatible between# python versions.# Caveat: In python 3.x, the serializer must return strings keys and support# bytes as values. Because of this reason the json or msgpack module will not# work by default. In python 2.x there is no such issue and you can use# 'json' or 'msgpack' as serializers.#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"# Don't cleanup redis queues, allows to pause/resume crawls.#SCHEDULER_PERSIST = True# Schedule requests using a priority queue. (default)#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'# Alternative queues.#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'# Max idle time to prevent the spider from being closed when distributed crawling.# This only works if queue class is SpiderQueue or SpiderStack,# and may also block the same time when your spider start at the first time (because the queue is empty).#SCHEDULER_IDLE_BEFORE_CLOSE = 10# Store scraped item in redis for post-processing.ITEM_PIPELINES={'scrapy_redis.pipelines.RedisPipeline':300}# The item pipeline serializes and stores the items in this redis key.#REDIS_ITEMS_KEY = '%(spider)s:items'# The items serializer is by default ScrapyJSONEncoder. You can use any# importable path to a callable object.#REDIS_ITEMS_SERIALIZER = 'json.dumps'# Specify the host and port to use when connecting to Redis (optional).#REDIS_HOST = 'localhost'#REDIS_PORT = 6379# Specify the full Redis URL for connecting (optional).# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.#REDIS_URL = 'redis://user:pass@hostname:9001'# Custom redis client parameters (i.e.: socket timeout, etc.)#REDIS_PARAMS = {}# Use custom redis client class.#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'# If True, it uses redis' ``spop`` operation. This could be useful if you# want to avoid duplicates in your start urls list. In this cases, urls must# be added via ``sadd`` command or you will get a type error from redis.#REDIS_START_URLS_AS_SET = False# Default start urls key for RedisSpider and RedisCrawlSpider.#REDIS_START_URLS_KEY = '%(name)s:start_urls'# Use other encoding than utf-8 for redis.#REDIS_ENCODING = 'latin1'< div > <注< > >
版本0.3将请求序列化从
marshal
更改为
cpickle
,
因此,使用0.2版的持久化请求将无法在0.3上工作。