Python spidy-web-crawler包_程序模块 - PyPI

spidy是一个简单易用的命令行web爬虫程序。

spidy-web-crawler的Python项目详细描述

spidy web crawler是一个简单易用的命令行web crawler。http://lxml.de/index.html>；``从页面中提取所有
链接。很简单！

spidy logo

版本：1.6.2发布：1.4.0许可证：GPL v3 python 3.3+所有
平台！||开源爱代码行：1553文档行：
605 _rm_nt/）
和"猎鹰队<；https://github.com/casillas->；`（/f_lc_nra_j_r/），
在"这些了不起的人"的帮助下开发而成，lt；https://github.com/rivermont/spidy contributors>；`。正在查找
技术文档？查看
``docs.md`<；https://github.com/rivermont/spidy/blob/master/spidy/docs/docs.md>；`\uu\
希望为该项目做出贡献？看一下
`` contribution.md`<；https://github.com/rivermont/spidy/blob/master/spidy/docs/contribution.md>；` ` `，
然后查看文档。

----

127881；新功能！
==

多线程
~~~~~~~~~~~~~~~~~~~~~

抓取所有东西！运行单独的线程以同时处理多个页面。这么快。太棒了。

pypi
~~~~

install spidy with one line:``pip install spidy web crawler`！

使用travis ci进行自动测试
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

v1.4.0-\`31663d3<；https://github.com/rivermont/spidy/commit/31663d34ceeba66ea9de9819b6da555492ed6a80>；`_ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~com/rivermont/spidy/releases/tag/1.4.0>；`_

内容
==

-`spidy web
crawler<；https://github.com/rivermont/spidy`spidy web crawler>；``uuu
-`new features！<；https://github.com/rivermont/spidy-new features>；`
-`内容<；https://github.com/rivermont/spidy-contents>；`
-`工作原理<；https://github.com/rivermont/spidy工作原理>；`
-`为什么不同<；https://github.com/rivermont/spidy why its different>；`
-`features<；https://github.com/rivermont/spidy features>；`
-`tutorial<；https://github.com/rivermont/spidy tutorial>；`

-`using with
docker<；https://github.com/rivermont/spidy与Docker一起使用`-` python
安装<；https://github.com/rivermont/spidy python installation>；` br/>
-`windows和
mac<；https://github.com/rivermont/spidy windows和mac>；` br/>-`anaconda<；https://github.com/rivermont/spidy anaconda>；` br/>-` python
base<；https://github.com/rivermont/spidy python base>；` br/>-`linux<；https://github.com/rivermont/spidy linux>；` br/>
-`crawler
安装<；https://github.com/rivermont/spidy crawler installation>；` br/>-`launching<；而对于"跑步"的人来说，又是一个不停地奔跑的人；对于"跑步"的人来说，又是一个不停地奔跑的人；对于"配置"的人来说，又是一个不停地奔跑的人；对于"跑步"的人来说，又是一个不停地奔跑的人；而对于"配置"的人来说，又是一个不停地奔跑的人；对于"开始"的人来说，又是一个不停地跑；对于"开始"的人来说，又是一个不停地跑的人；对于"开始"的人来说，又是一个不停地跑的人；对于"开始"的人来说，又是一个不停地跑的人；对于"开始"的人来说，又是一个不停地跑的人；对于"开始"的人来说，又是一
-`自动保存<；https://github.com/rivermont/spidy autosave>；`
-`force quit<；https://github.com/rivermont/spidy force quit>；`

-`我如何支持
这一点？<；https://github.com/rivermont/rivermont/spidy how-can-i-support-this>；`
-`贡献者<；https://github.com/rivermont/spidy贡献者>；`
-`license<；https://github.com/rivermont/spidy 5757577;许可证>；`

如何工作
如何工作
===================
>>Spidy有两个工作列表，`` todo``和'done`'。todo'是尚未访问的url的列表。done'是它已经
到过的url列表。爬虫程序访问todo中的每个页面，从
页面的dom中获取链接，并将这些链接添加回todo中。它还可以保存每个
页面，因为数据囤积。

大多数其他选项都不是网络爬虫本身，
只是一些框架和库，通过这些框架和库可以创建和部署一个网络蜘蛛，例如scrapy和beautifulsoup。scrapy是一个web
爬网框架，用python编写，专门为下载而创建，从web上清理和保存数据，而beautifulsoup
是一个解析库，它允许程序员从网页中获取特定元素，但仅仅beautifulsoup是不够的，因为你有
可以在第一时间实际获取网页。

开箱即用。spidy是一个web
爬虫程序，易于使用，可以从命令行运行。你必须给它一个网页的url链接，它就会开始爬行！一种非常简单有效的从网络上获取信息的方法。

以下是一些值得注意的功能：
我们认为这些功能是值得注意的。

-错误处理：我们尝试识别spidy
遇到的所有错误，并为每个错误创建自定义错误消息和日志。
有一个设置的上限，以便在累积了太多错误之后，
爬虫程序将停止它自己。
-跨平台兼容性：spidy将在所有三大操作系统、windows、mac os/x和linux上工作！
-频繁的时间戳日志记录：spidy几乎将它所需的所有操作都记录到控制台和两个日志文件中的一个。
-浏览器欺骗：使用4个流行web浏览器中的用户代理发出请求，使用自定义spidy bot，或创建自己的用户代理！
-可移植性：将spidy的文件夹及其内容移到其他地方，然后
它将在它离开的地方运行。*注意*：只有在从源代码运行时，此操作才有效。
-用户友好的日志：控制台和日志文件消息都很简单，易于解释，但都包含信息。
-网页保存：spidy下载它运行到的每个页面，
无论文件类型如何。爬网程序使用大多数文件返回的http``content type`
头来确定文件类型。
-文件压缩：当自动保存时，spidy可以将
``saved/``目录的内容存档到``.zip``文件，然后清除"saved/"。

tutorial
==

`_"
`` docker build-t spidy.``
-验证是否已创建docker映像：`` docker images`
-然后，运行它：``docker run--rm-it-v$pwd:/data spidy``
-``--``-rm``告诉docker删除停止的
容器来清理自身。
-``-it``告诉docker以交互方式运行容器并分配一个伪tty。
-`-v$pwd:/data``告诉docker装载当前工作目录
作为容器中的`/data``目录。如果您想要spidy的文件（例如"crawler"done.txt`，`` crawler-words.txt`，
`` crawler-todo.txt``）写回主机文件系统。

~~~~~~~~~~~~~~~~~

……图：：media/spidy_docker_demo.gif
：alt:spidy docker demo

简单运行
"spidy"命令。工作文件将在您的home
目录中找到。

在安装python
的过程中，

windows和mac
^^^^^^^^^^^^^^^^^

python<；https://www.python.org/about/>；``uu有许多不同版本，每个版本都有数百种不同的安装。spidy是为python v3.5.2开发的，但是
在python 3的其他版本中应该不会出错。

anaconda
'''''''''

我们建议使用"anaconda
发行版<；https://www.continuum.io/downloads>；``。它预装了很多好东西，包括'lxml`，spidy需要
才能运行，而不是包含在标准python包中。

'''''''''''''

python<；https://www.python.org/downloads/>；``单独安装外部
库。这可以通过``pip``:

：：

^^^^^

not，只要运行

：：

sudo apt update
sudo apt install python3 python3 lxml python3 requests

然后将"cd"安装到爬虫程序的目录中，并运行"python3 crawler.py`.

crawler installation
~~~~~~~~~~~~~~~~~~~~~~~

安装后，您可以从这里克隆存储库<；https://github.com/rivermont/spidy.git>；`。如果
没有，请下载"最新源代码"https://github.com/rivermont/spidy/archive/master.zip>；``或获取"最新版本"https://github.com/rivermont/spidy/releases>；`_.

启动
~图：：https://raw.githubusercontent.com/rivermont/spidy/master/media/run.gif
：alt:

running
~~~~~~~

spidy在其整个生命周期中将大量信息记录到命令行。
一旦启动，将打印一堆`[init]``行。这些文件宣布了spidy在初始化过程中的位置。

config
^^^^^^

但是，您也可以使用其中一个配置文件，或者甚至可以创建自己的配置文件。

若要将spidy与配置文件一起使用，请在爬虫程序请求

spidy包含的配置文件是：

-*``blank.txt```：用于创建自己的配置文件的模板配置。
-``default.cfg``：默认版本。
-``heavy.cfg```：在启用所有功能的情况下运行spidy。
-``infinite.cfg``：默认配置，但它从不停止自身。
-``light.cfg``：禁用大多数功能；只爬网链接的页面。
-``rivermont.cfg`：我个人最喜欢的设置。
-``rivermont infinite.cfg``：我最喜欢的，永不结束的配置。

开始
^^^^

示例开始日志。

图：：https://raw.githubusercontent.com/rivermont/spidy/master/media/start.png
：alt:

autosave
^^^^^^^^^^

autosave cap后的示例日志。

…图：：https://raw.githubusercontent.com/rivermont/spidy/master/media/log.png
：alt:

force quit
^^^^^^^^^^^^^^^

图：：https://raw.githubusercontent.com/rivermont/spidy/master/media/keyboardinterrupt.png
：alt:

我如何支持它？
==
==

如果你觉得很酷的话，最简单的方法就是使用star spidy，如果你想更新的话，可以观看。如果您有建议，
`创建问题<；https://github.com/rivermont/spidy/issues/new>；``或
分叉"master"分支并打开拉取请求。

contributors
==

请参阅
`` contribution.md`<；https://github.com/rivermont/spidy/blob/master/spidy/docs/contribution.md>；`

-徽标由"cutwell<；https://github.com/cutwell>；`

-`3onyc<；https://github.com/3onyc>；`-pep8合规性设计。
-`dekan<；https://github.com/dekan>；`pypi打包到
工作。
-`esouthren<；https://github.com/esouthren>；``单元测试。
-`hriliy<；https://github.com/hriliy>；``多线程。
-`j-setiawan<；https://github.com/j-setiawan>；``在
所有操作系统上工作的路径。
-`michellemorales<；https://github.com/michellemorales>；``确认
OS/X支持。
-`peterbenjamin<；https://github.com/peterbenjamin>；`` Docker
支持。
-`quatroka<；https://github.com/quatroka>；``修复测试错误。
-`stevelle<；https://github.com/stevelle>；`_-尊重robots.txt。
-`that guy withthat name<；https://github.com/that guy withthat name>；` `-
自述链接更正。

许可证
==

我们使用了"GNU通用公共许可证"
许可证<；https://www.gnu.org/licenses/gpl-3.0.en.html>；`_（请参见
``许可证`<；https://github.com/rivermont/spidy/blob/master/license>；`\br/>，因为它是最适合我们需要的许可证。老实说，如果您将
链接到这个回购和信贷"rivermont"和"falconwarrior"，并且您
没有以任何方式销售spidy，那么我们希望您能
分发它。谢谢！

----

…| spidy logo图像：：https://raw.githubusercontent.com/rivermont/spidy/master/media/spidy_logo.png
：目标：https://github.com/rivermont/spidy贡献者
。|版本：1.6.2图像：：https://img.shields.io/badge/version-1.6.2-brightgreen.svg
…|发布：1.4.0 image：：https://img.shields.io/github/release/rivermont/spidy.svg
：目标：https://github.com/rivermont/spidy/releases
。|许可证：gpl v3 image：：https://img.shields.io/badge/license-gplv3.0-blue.svg
：目标：http://www.gnu.org/licenses/gpl-3.0
。| python 3.3+图像：：https://img.shields.io/badge/python-3.3+-brightgreen.svg
：目标：https://docs.python.org/3/
…|所有平台！|图片：：https://img.shields.io/badge/windows、%20os/x、%20linux-%20%20 brightgreen.svg
…|开源爱情图片：https://badges.frapsoft.com/os/v1/open-source.png？V=103
。|代码行：1553图像：：https://img.shields.io/badge/lines%20of%20code-1553-brightgreen.svg
…|文档行：605图像：：https://img.shields.io/badge/lines%20of%20docs-605-orange.svg
…|上次提交图像：：https://img.shields.io/github/last commit/rivermont/spidy.svg
：目标：https://github.com/rivermont/spidy/graphs/punch card
。| travis ci status image：：https://img.shields.io/travis/rivermont/spidy/master.svg
：目标：https://travis ci.org/rivermont/spidy
。| pypi wheel image：：https://img.shields.io/pypi/wheel/spidy web crawler.svg
：目标：https://pypi.org/project/spidy web crawler/
。| pypi status image：：https://img.shields.io/pypi/status/spidy web crawler.svg
：目标：https://pypi.org/project/spidy web crawler/：目标：https://github.com/rivermont/spidy/graphs/contributors
。|叉子图片：https://img.shields.io/github/forks/rivermont/spidy.svg？style=social&；label=forks
：目标：https://github.com/rivermont/spidy/network
…|星星图片：https://img.shields.io/github/stars/rivermont/spidy.svg？style=social&label=stars
：目标：https://github.com/rivermont/spidy/stargazers

欢迎加入QQ群-->： 979659372

spidy-web-crawler 1.6.5

spidy-web-crawler的Python项目详细描述

推荐PyPI第三方库

certbot-dns-openstack

filtertools

bananaplots

ludolph

sermon

py-roku

zops

hdfs3

mambo

py-retain

eventful

odoo10-addon-product-price-categor

memprof

simpleplotdigitizer

punt

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

spidy-web-crawler 1.6.5

spidy-web-crawler的Python项目详细描述

推荐PyPI第三方库

certbot-dns-openstack

filtertools

bananaplots

ludolph

sermon

py-roku

zops

hdfs3

mambo

py-retain

eventful

odoo10-addon-product-price-categor

memprof

simpleplotdigitizer

punt

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签