Python ant_nest包_程序模块 - PyPI

基于python3.6+和async的简单清晰的web爬虫框架

ant_nest的Python项目详细描述

https://img.shields.io/pypi/v/ant_nest.svg

https://img.shields.io/travis/strongbugman/ant_nest/master.svg

https://codecov.io/gh/strongbugman/ant_nest/branch/master/graph/badge.svg

概述

antnest是一个基于python3.6+的简单、清晰、快速的网络爬虫框架，由asyncio提供支持。它现在只有600多行的核心代码（感谢aiohttp、lxml等强大的库）。

功能

实用的现成http客户端
事物（请求、响应和项）可以通过管道（异步或非异步）传递。
item extract or，很容易从html、json或字符串中定义和提取（通过xpath、jpath或regex）一个我们想要的项。
自定义的“确保未来”和“完成时”API提供了一个简单的工作流程

安装

pip install ant_nest

用法

通过cli创建一个演示项目：

>>> ant_nest -c examples

然后我们有一个项目：

drwxr-xr-x   5 bruce  staff  160 Jun 30 18:24 ants
-rw-r--r--   1 bruce  staff  208 Jun 26 22:59 settings.py

假设我们想从github获得热回购，那么让我们创建“examples/ants/example2.py”：

from yarl import URL
from ant_nest.ant import Ant
from ant_nest.pipelines import ItemFieldReplacePipeline
from ant_nest.things import ItemExtractor


class GithubAnt(Ant):
    """Crawl trending repositories from github"""

    item_pipelines = [
        ItemFieldReplacePipeline(
            ("meta_content", "star", "fork"), excess_chars=("\r", "\n", "\t", "  ")
        )
    ]
    concurrent_limit = 1  # save the website`s and your bandwidth!

    def __init__(self):
        super().__init__()
        self.item_extractor = ItemExtractor(dict)
        self.item_extractor.add_extractor(
            "title", lambda x: x.html_element.xpath("//h1/strong/a/text()")[0]
        )
        self.item_extractor.add_extractor(
            "author", lambda x: x.html_element.xpath("//h1/span/a/text()")[0]
        )
        self.item_extractor.add_extractor(
            "meta_content",
            lambda x: "".join(
                x.html_element.xpath(
                    '//div[@class="repository-content "]/div[2]//text()'
                )
            ),
        )
        self.item_extractor.add_extractor(
            "star",
            lambda x: x.html_element.xpath(
                '//a[@class="social-count js-social-count"]/text()'
            )[0],
        )
        self.item_extractor.add_extractor(
            "fork",
            lambda x: x.html_element.xpath('//a[@class="social-count"]/text()')[0],
        )
        self.item_extractor.add_extractor("origin_url", lambda x: str(x.url))

    async def crawl_repo(self, url):
        """Crawl information from one repo"""
        response = await self.request(url)
        # extract item from response
        item = self.item_extractor.extract(response)
        item["origin_url"] = response.url

        await self.collect(item)  # let item go through pipelines(be cleaned)
        self.logger.info("*" * 70 + "I got one hot repo!\n" + str(item))

    async def run(self):
        """App entrance, our play ground"""
        response = await self.request("https://github.com/explore")
        for url in response.html_element.xpath(
            "/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/"
            "@href"
        ):
            # crawl many repos with our coroutines pool
            self.schedule_task(self.crawl_repo(response.url.join(URL(url))))
        self.logger.info("Waiting...")

然后我们可以列出我们定义的所有蚂蚁（在“示例”中）

>>> $ant_nest -l
ants.example2.GithubAnt

快跑！（不带调试日志）：

>>> ant_nest -a ants.example2.GithubAnt
INFO:GithubAnt:Opening
INFO:GithubAnt:Waiting...
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'NLP-progress', 'author': 'sebastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebastianruder/NLP-progress')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi…', 'star': '6,032', 'fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'pure-bash-bible', 'author': 'dylanaraps', 'meta_content': '? A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'flutter', 'author': 'flutter', 'meta_content': 'Flutter makes it easy and fast to build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1,337', 'origin_url': URL('https://github.com/flutter/flutter')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '?\u200d? Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J…', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')}
INFO:GithubAnt:Closed
INFO:GithubAnt:Get 7 Request in total
INFO:GithubAnt:Get 7 Response in total
INFO:GithubAnt:Get 6 dict in total
INFO:GithubAnt:Run GithubAnt in 18.157656 seconds

因此，很容易按类属性配置ant

class Ant(abc.ABC):
    response_pipelines: typing.List[Pipeline] = []
    request_pipelines: typing.List[Pipeline] = []
    item_pipelines: typing.List[Pipeline] = []
    request_cls = Request
    response_cls = Response
    request_timeout = 60
    request_retries = 3
    request_retry_delay = 5
    request_proxies: typing.List[typing.Union[str, URL]] = []
    request_max_redirects = 10
    request_allow_redirects = True
    response_in_stream = False
    connection_limit = 10  # see "TCPConnector" in "aiohttp"
    connection_limit_per_host = 0
    concurrent_limit = 100

您可以为一个请求重写一些配置

async def request(
    self,
    url: typing.Union[str, URL],
    method: str = aiohttp.hdrs.METH_GET,
    params: typing.Optional[dict] = None,
    headers: typing.Optional[dict] = None,
    cookies: typing.Optional[dict] = None,
    data: typing.Optional[
        typing.Union[typing.AnyStr, typing.Dict, typing.IO]
    ] = None,
    proxy: typing.Optional[typing.Union[str, URL]] = None,
    timeout: typing.Optional[float] = None,
    retries: typing.Optional[int] = None,
    response_in_stream: typing.Optional[bool] = None,
) -> Response:

关于项目

我们使用dict在示例中存储一个项，实际上它支持多种定义项的方法： dict、normal类、atrrs类、data类和orm类，这取决于您的需要和选择。

示例

您可以在“../examples”

中获取一些示例

缺陷

复杂异常句柄

一个协同程序的异常将打破等待链，特别是在循环中，除非我们手动处理它。例如：

for cor in self.as_completed((self.crawl(url) for url in self.urls)):
    try:
        await cor
    except Exception:  # may raise many exception in a await chain
        pass

但我们现在可以使用“self.as_completed_with_async”，例如：

async fo result in self.as_completed_with_async(
self.crawl(url) for url in self.urls, raise_exception=False):
    # exception in "self.crawl(url)" will be passed and logged automatic
    self.handle(result)

高内存使用率

异步占用大内存，特别是高并发IO的“特性”，我们可以设置并发限制（“连接限制”或“并发限制”）很简单，但要在性能和限制之间取得平衡是很复杂的。

编码方式

遵循“flake8”，格式为“black”，键入check by“mypy”，sea makefile了解更多详细信息。

待办事项

[*]日志系统 [*]套料提取器 []文件

欢迎加入QQ群-->： 979659372

ant_nest 0.38.1

ant_nest的Python项目详细描述

概述

功能

安装

用法

关于项目

示例

缺陷

编码方式

待办事项

推荐PyPI第三方库

odoo9-addon-datetime-formatter

amvernoncal

jsfiddle-generator

py3minepi

python-libdiscid

onegov.people

config-provider

glusterlog

camocomp

netsgiro

genutils

kitsiso

envinspector-sha1n

singletonif

decolib

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

ant_nest 0.38.1

ant_nest的Python项目详细描述

概述

功能

安装

用法

关于项目

示例

缺陷

编码方式

待办事项

推荐PyPI第三方库

odoo9-addon-datetime-formatter

amvernoncal

jsfiddle-generator

py3minepi

python-libdiscid

onegov.people

config-provider

glusterlog

camocomp

netsgiro

genutils

kitsiso

envinspector-sha1n

singletonif

decolib

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签