基于redis的scrapy组件。

scrapy-redis的Python项目详细描述


垃圾redis

documentation statushttps://img.shields.io/pypi/v/scrapy-redis.svghttps://img.shields.io/pypi/pyversions/scrapy-redis.svghttps://img.shields.io/travis/rolando/scrapy-redis.svg覆盖状态代码质量状态需求状态

用于废料的基于redis的组件。

  • 自由软件:麻省理工学院许可证
  • 文档:https://scrapy-redis.readthedocs.org" rel="nofollow">https://scrapy redis.readthedocs.org
  • python版本:2.7、3.4+

功能

  • 分布式爬网/抓取

    < Buff行情>

    可以启动共享单个redis队列的多个spider实例。 最适合广泛的多域爬网。

  • 分布式后处理

    < Buff行情>

    报废项目将被推送到redis队列中,这意味着您可以从 许多按需共享项目队列的后处理进程。

  • 破旧的即插即用部件

    < Buff行情>

    调度程序+复制筛选器、项目管道、基本蜘蛛。

要求

  • python 2.7、3.4或3.5
  • redis=2.8
  • scrapy >;=1.0
  • redis py >;=2.10

用法

在项目中使用以下设置:

# Enables scheduling storing requests queue in redis.SCHEDULER="scrapy_redis.scheduler.Scheduler"# Ensure all spiders share same duplicates filter through redis.DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"# Default requests serializer is pickle, but it can be changed to any module# with loads and dumps functions. Note that pickle is not compatible between# python versions.# Caveat: In python 3.x, the serializer must return strings keys and support# bytes as values. Because of this reason the json or msgpack module will not# work by default. In python 2.x there is no such issue and you can use# 'json' or 'msgpack' as serializers.#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"# Don't cleanup redis queues, allows to pause/resume crawls.#SCHEDULER_PERSIST = True# Schedule requests using a priority queue. (default)#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'# Alternative queues.#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'# Max idle time to prevent the spider from being closed when distributed crawling.# This only works if queue class is SpiderQueue or SpiderStack,# and may also block the same time when your spider start at the first time (because the queue is empty).#SCHEDULER_IDLE_BEFORE_CLOSE = 10# Store scraped item in redis for post-processing.ITEM_PIPELINES={'scrapy_redis.pipelines.RedisPipeline':300}# The item pipeline serializes and stores the items in this redis key.#REDIS_ITEMS_KEY = '%(spider)s:items'# The items serializer is by default ScrapyJSONEncoder. You can use any# importable path to a callable object.#REDIS_ITEMS_SERIALIZER = 'json.dumps'# Specify the host and port to use when connecting to Redis (optional).#REDIS_HOST = 'localhost'#REDIS_PORT = 6379# Specify the full Redis URL for connecting (optional).# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.#REDIS_URL = 'redis://user:pass@hostname:9001'# Custom redis client parameters (i.e.: socket timeout, etc.)#REDIS_PARAMS  = {}# Use custom redis client class.#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'# If True, it uses redis' ``spop`` operation. This could be useful if you# want to avoid duplicates in your start urls list. In this cases, urls must# be added via ``sadd`` command or you will get a type error from redis.#REDIS_START_URLS_AS_SET = False# Default start urls key for RedisSpider and RedisCrawlSpider.#REDIS_START_URLS_KEY = '%(name)s:start_urls'# Use other encoding than utf-8 for redis.#REDIS_ENCODING = 'latin1'
< div > <注< > >

版本0.3将请求序列化从 marshal 更改为 cpickle , 因此,使用0.2版的持久化请求将无法在0.3上工作。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java如何在windows上向doclet添加多个sourcepath?   java谷歌地图应用程序   java为以下场景创建正则表达式   Java文件通道异常   Java集合如何将文件对象列表转换为路径对象列表?   多线程生产者/消费者模型使用Java(同步),但始终运行同一线程   java如何为存储在ArrayList中的特定属性设置值?   java一次不能加载多个osm文件   使用java 8将map<String,map<Long,customeObject>>转换为list<customeObject>   java JDK包含哪些脚本语言解释器?   java为什么eclipse在这里生成语法错误?   多线程Javasocket异常:socket已关闭且值为空   java我想在Android活动中创建一个带有图像的可滚动文本列表。我应该用什么?   java实现编译时警告   java根据安卓 SQLite数据库中前一行的相同值递增特定值   java移动迭代器语句使代码可编译   java JVM终身/旧代已达到限制&服务器挂起   为什么我们不能在映射上使用迭代器(Java)?   xml如何映射JAXB中已有的JavaBean