一个为scrapy更改代理的中间件
scrapy-rotated-prox的Python项目详细描述
概述
scrapy rotated proxy是一个动态附加代理到请求的scrapy下载中间件, 可以使用配置提供的旋转代理。 它可以暂时阻止不可用的代理IP 当代理可用时检索以供将来使用。 此外,它还可以通过剪贴信号去除无效的代理IP。 代理IP列表可以通过Spider设置、文件或MongoDB提供。
要求
- python 2.7或python 3.3+
- 适用于Linux、Windows、Mac OSX、BSD
安装
快捷方式:
pip install scrapy-rotated-proxy
或者将这个中间件复制到您的scrapy项目中。
配置
基本配置
启用Spider设置
启用Scrapy旋转代理中间件并通过Spider设置提供代理IP列表
# ----------------------------------------------------------------------------- # ROTATED PROXY SETTINGS (Spider Settings Backend) # ----------------------------------------------------------------------------- DOWNLOADER_MIDDLEWARES.update({ 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'scrapy_rotated_proxy.downloadmiddlewares.proxy.RotatedProxyMiddleware': 750, }) ROTATED_PROXY_ENABLED = True PROXY_STORAGE = 'scrapy_rotated_proxy.extensions.file_storage.FileProxyStorage' # When set PROXY_FILE_PATH='', scrapy-rotated-proxy # will use proxy in Spider Settings default. PROXY_FILE_PATH = '' HTTP_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888', ] HTTPS_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888', ]
使用本地文件启用
启用Scrapy旋转代理中间件并通过本地文件提供代理IP列表
# ----------------------------------------------------------------------------- # ROTATED PROXY SETTINGS (Local File Backend) # ----------------------------------------------------------------------------- DOWNLOADER_MIDDLEWARES.update({ 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'scrapy_rotated_proxy.downloadmiddlewares.proxy.RotatedProxyMiddleware': 750, }) ROTATED_PROXY_ENABLED = True PROXY_STORAGE = 'scrapy_rotated_proxy.extensions.file_storage.FileProxyStorage' PROXY_FILE_PATH = 'file_path/proxy.txt'
具有json样式的本地文件存储代理列表
# proxy file content, must conform to json format, otherwise will cause json # load error HTTP_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888' ] HTTPS_PROXIES = [ 'http://proxy0:8888', 'http://user:pass@proxy1:8888', 'https://user:pass@proxy1:8888' ]
使用MongoDB启用
通过MongoDB启用Scrapy Rotated Proxy中间件并提供代理IP列表
# ----------------------------------------------------------------------------- # ROTATED PROXY SETTINGS (MongoDB Backend) # ----------------------------------------------------------------------------- # mongodb document required field: scheme, ip, port, username, password # document example: {'scheme': 'http', 'ip': '10.0.0.1', 'port': 8080, # 'username':'user', 'password':'password'} DOWNLOADER_MIDDLEWARES.update({ 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'scrapy_rotated_proxy.downloadmiddlewares.proxy.RotatedProxyMiddleware': 750, }) ROTATED_PROXY_ENABLED = True PROXY_STORAGE = 'scrapy_rotated_proxy.extensions.mongodb_storage.MongoDBProxyStorage' PROXY_MONGODB_HOST = HOST_OR_IP PROXY_MONGODB_PORT = 27017 PROXY_MONGODB_USERNAME = USERNAME_OR_NONE PROXY_MONGODB_PASSWORD = PASSWORD_OR_NONE PROXY_MONGODB_AUTH_DB = 'admin' PROXY_MONGODB_DB = 'vps_management' PROXY_MONGODB_COLL = 'service'
高级配置
块设置
默认情况下,在所有代理用完后,蜘蛛将关闭。你可以将spider配置为 等待,直到块代理变为有效,通过信号阻止# ----------------------------------------------------------------------------- # OTHER SETTINGS (Optional) # ----------------------------------------------------------------------------- PROXY_SLEEP_INTERVAL = 60*60*24 # Default 24hours PROXY_SPIDER_CLOSE_WHEN_NO_PROXY = False # Default True
信号
删除蜘蛛中从未使用过的代理,可以将信号发送到 scrapy_rotated_proxy.signals.proxy_remove,哪个信号必须包含参数 包括spider,request,exception
在达到睡眠间隔后,可以在将来使用的块代理,您可以将信号发送到 scrapy_rotated_proxy.signals.proxy_block,哪个信号必须包含参数 包括spider,response,exception
设置参考
Setting | Description | Default |
ROTATED_PROXY_ENABLED | Whether to enable Scrapy-Rotated-Proxy | True |
PROXY_STORAGE | A class which implements the proxy storage backend | FileProxyStorage |
PROXY_MONGODB_HOST | MongoDB host for MongoDB backend | ‘127.0.0.1’ |
PROXY_MONGODB_PORT | MongoDB port for MongoDB backend | 27017 |
PROXY_MONGODB_USERNAME | MongoDB username for MongoDB backend | None |
PROXY_MOGNODB_PASSWORD | MongoDB password for MongoDB backend | None |
PROXY_MONGODB_DB | MongoDB database name for MongoDB backend | proxy_management |
PROXY_MONGODB_COLL | MongoDB collection name for MongoDB backend | proxy |
PROXY_MONGODB_OPTIONS_* | MongoDB uri options for MongoDB backend | |
PROXY_FILE_PATH | Path of file that store proxies. default is None, means get proxies from Spider Settings | None |
HTTP_PROXIES | keywords of HTTP proxies for LocalFile backend or Spider Settings | |
HTTPS_PROXIES | keywords of HTTPS proxies for LocalFile backend or Spider Settings | |
PROXY_SLEEP_INTERVAL | Time to sleep for blocked proxy become available | 60*60*24 |
PROXY_SPIDER_CLOSE_WHEN_NO_PROXY | Whether to close spider when run out of all proxies | True |
PROXY_RELOAD_ENABLED | enable to reload proxy from storage when all proxies was used and prepare to cycle use | False |