纯python robots.txt解析器,支持现代约定
Protego的Python项目详细描述
protego
概述
protego是一个纯python robots.txt
解析器,支持现代约定。
要求
- Python2.7或Python3.5+
- 适用于Linux、Windows、Mac OSX、BSD
安装
要安装protego,只需使用pip:
pip install protego
用法
>>>fromprotegoimportProtego>>>robotstxt="""... User-agent: *... Disallow: /... Allow: /about... Allow: /account... Disallow: /account/contact$... Disallow: /account/*/profile... Crawl-delay: 4... Request-rate: 10/1m # 10 requests every 1 minute... ... Sitemap: http://example.com/sitemap-index.xml... Host: http://example.co.in... """>>>rp=Protego.parse(robotstxt)>>>rp.can_fetch("http://example.com/profiles","mybot")False>>>rp.can_fetch("http://example.com/about","mybot")True>>>rp.can_fetch("http://example.com/account","mybot")True>>>rp.can_fetch("http://example.com/account/myuser/profile","mybot")False>>>rp.can_fetch("http://example.com/account/contact","mybot")False>>>rp.crawl_delay("mybot")4.0>>>rp.request_rate("mybot")RequestRate(requests=10,seconds=60,start_time=None,end_time=None)>>>list(rp.sitemaps)['http://example.com/sitemap-index.xml']>>>rp.preferred_host'http://example.co.in'
使用protego和Requests
>>>fromprotegoimportProtego>>>importrequests>>>r=requests.get("https://google.com/robots.txt")>>>rp=Protego.parse(r.text)>>>rp.can_fetch("https://google.com/search","mybot")False>>>rp.can_fetch("https://google.com/search/about","mybot")True>>>list(rp.sitemaps)['https://www.google.com/sitemap.xml']
文件
类protego.Protego
:
属性
sitemaps
{list_iterator
中指定的站点地图列表。preferred_host
{string}在robots.txt
中指定的首选主机。
方法
parse(robotstxt_body)
解析robots.txt
,并返回protego.Protego
的新实例。can_fetch(url, user_agent)
如果用户代理可以获取url,则返回true,否则返回false。crawl_delay(user_agent)
将为用户代理指定的爬网延迟作为浮点值返回如果未指定任何内容,则返回none。request_rate(user_agent)
返回为用户代理指定的请求速率,作为命名元组RequestRate(requests, seconds, start_time, end_time)
。如果未指定任何内容,则返回none。