基于python3.6+和async的简单清晰的web爬虫框架
ant_nest的Python项目详细描述
概述
antnest是一个基于python3.6+的简单、清晰、快速的网络爬虫框架,由asyncio提供支持。 它现在只有600多行的核心代码(感谢aiohttp、lxml等强大的库)。
功能
- 实用的现成http客户端
- 事物(请求、响应和项)可以通过管道(异步或非异步)传递。
- item extract or,很容易从html、json或字符串中定义和提取(通过xpath、jpath或regex)一个我们想要的项。
- 自定义的“确保未来”和“完成时”API提供了一个简单的工作流程
安装
pip install ant_nest
用法
通过cli创建一个演示项目:
>>> ant_nest -c examples
然后我们有一个项目:
drwxr-xr-x 5 bruce staff 160 Jun 30 18:24 ants -rw-r--r-- 1 bruce staff 208 Jun 26 22:59 settings.py
假设我们想从github获得热回购,那么让我们创建“examples/ants/example2.py”:
from yarl import URL from ant_nest.ant import Ant from ant_nest.pipelines import ItemFieldReplacePipeline from ant_nest.things import ItemExtractor class GithubAnt(Ant): """Crawl trending repositories from github""" item_pipelines = [ ItemFieldReplacePipeline( ("meta_content", "star", "fork"), excess_chars=("\r", "\n", "\t", " ") ) ] concurrent_limit = 1 # save the website`s and your bandwidth! def __init__(self): super().__init__() self.item_extractor = ItemExtractor(dict) self.item_extractor.add_extractor( "title", lambda x: x.html_element.xpath("//h1/strong/a/text()")[0] ) self.item_extractor.add_extractor( "author", lambda x: x.html_element.xpath("//h1/span/a/text()")[0] ) self.item_extractor.add_extractor( "meta_content", lambda x: "".join( x.html_element.xpath( '//div[@class="repository-content "]/div[2]//text()' ) ), ) self.item_extractor.add_extractor( "star", lambda x: x.html_element.xpath( '//a[@class="social-count js-social-count"]/text()' )[0], ) self.item_extractor.add_extractor( "fork", lambda x: x.html_element.xpath('//a[@class="social-count"]/text()')[0], ) self.item_extractor.add_extractor("origin_url", lambda x: str(x.url)) async def crawl_repo(self, url): """Crawl information from one repo""" response = await self.request(url) # extract item from response item = self.item_extractor.extract(response) item["origin_url"] = response.url await self.collect(item) # let item go through pipelines(be cleaned) self.logger.info("*" * 70 + "I got one hot repo!\n" + str(item)) async def run(self): """App entrance, our play ground""" response = await self.request("https://github.com/explore") for url in response.html_element.xpath( "/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/" "@href" ): # crawl many repos with our coroutines pool self.schedule_task(self.crawl_repo(response.url.join(URL(url)))) self.logger.info("Waiting...")
然后我们可以列出我们定义的所有蚂蚁(在“示例”中)
>>> $ant_nest -l ants.example2.GithubAnt
快跑!(不带调试日志):
>>> ant_nest -a ants.example2.GithubAnt INFO:GithubAnt:Opening INFO:GithubAnt:Waiting... INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'NLP-progress', 'author': 'sebastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebastianruder/NLP-progress')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi…', 'star': '6,032', 'fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'pure-bash-bible', 'author': 'dylanaraps', 'meta_content': '? A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'flutter', 'author': 'flutter', 'meta_content': 'Flutter makes it easy and fast to build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1,337', 'origin_url': URL('https://github.com/flutter/flutter')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '?\u200d? Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J…', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')} INFO:GithubAnt:Closed INFO:GithubAnt:Get 7 Request in total INFO:GithubAnt:Get 7 Response in total INFO:GithubAnt:Get 6 dict in total INFO:GithubAnt:Run GithubAnt in 18.157656 seconds
因此,很容易按类属性配置ant
class Ant(abc.ABC): response_pipelines: typing.List[Pipeline] = [] request_pipelines: typing.List[Pipeline] = [] item_pipelines: typing.List[Pipeline] = [] request_cls = Request response_cls = Response request_timeout = 60 request_retries = 3 request_retry_delay = 5 request_proxies: typing.List[typing.Union[str, URL]] = [] request_max_redirects = 10 request_allow_redirects = True response_in_stream = False connection_limit = 10 # see "TCPConnector" in "aiohttp" connection_limit_per_host = 0 concurrent_limit = 100
您可以为一个请求重写一些配置
async def request( self, url: typing.Union[str, URL], method: str = aiohttp.hdrs.METH_GET, params: typing.Optional[dict] = None, headers: typing.Optional[dict] = None, cookies: typing.Optional[dict] = None, data: typing.Optional[ typing.Union[typing.AnyStr, typing.Dict, typing.IO] ] = None, proxy: typing.Optional[typing.Union[str, URL]] = None, timeout: typing.Optional[float] = None, retries: typing.Optional[int] = None, response_in_stream: typing.Optional[bool] = None, ) -> Response:
关于项目
我们使用dict在示例中存储一个项,实际上它支持多种定义项的方法: dict、normal类、atrrs类、data类和orm类,这取决于您的需要和选择。
示例
您可以在“../examples”
中获取一些示例缺陷
- 复杂异常句柄
一个协同程序的异常将打破等待链,特别是在循环中,除非我们手动处理它。例如:
for cor in self.as_completed((self.crawl(url) for url in self.urls)): try: await cor except Exception: # may raise many exception in a await chain pass
但我们现在可以使用“self.as_completed_with_async”,例如:
async fo result in self.as_completed_with_async( self.crawl(url) for url in self.urls, raise_exception=False): # exception in "self.crawl(url)" will be passed and logged automatic self.handle(result)
- 高内存使用率
异步占用大内存,特别是高并发IO的“特性”,我们可以设置 并发限制(“连接限制”或“并发限制”)很简单,但要在性能和限制之间取得平衡是很复杂的。
编码方式
遵循“flake8”,格式为“black”,键入check by“mypy”,sea makefile了解更多详细信息。
待办事项
[*]日志系统 [*]套料提取器 []文件