Python html_librarian包_程序模块 - PyPI

永远不要再从一个站点上多次删除html。

html_librarian的Python项目详细描述

#图书管理员

这个软件包的目标几乎是像一个为网页抓取设置的训练轮子。

一个好的例子是递归地尝试访问网站上的所有链接，例如：

http://web.archive.org/web/20080827084856/http://www.nanowerk.com:80/nanotechnology/nanommaterial/commercial\u all.php？page=2

“图书管理员”的目标是保存HTML以备以后使用，这样你就不必重复以前的工作，让你对所请求的地方更加友好，节省时间，让您体验更流畅的刮擦体验。

让我们概述一个示例：

通过inspect元素查看上面的html站点；您将看到所有名称和链接都位于<；div class="divhead">, and all of the blurbs are under <div class="divline">. Now, I would probably do this:

``` python3
来自urllib.request import urlopen
来自bs4 import beautifulsoup

alink='http://web.archive.org/web/20080827084856/http://www.nanowerk.com:80/nanotechnology/nanommaterial/commercial\u all.php？page=2'
resp=urlopen还有别的原因。
出于上述第一个原因，我可能会对“div.divline”进行相同的检查。然后，对于这些站点，我必须递归地访问它们并获取它们的html。

我们可以这样做：

``python3
来自图书管理员import library

html=lib.get（alink）
soup=beautifulsoup（html，'lxml'）
````

“图书管理员”会找到它并从你的“htmlibrary”中提取出来，这样你就可以立即使用它。

如果你需要更新的html，只需从图书管理员导入图书管理员

removed=lib.remove（alink）
assert removed
````

`lib.remove（alink）`将从'htmlibrary'和'linkmap'中删除链接'alink'，因此，如果使用相同的链接调用'lib.get（alink）'，“图书管理员”将再次获得HTML。

此项目尚处于初级阶段，因此如果您想要创建任何功能，请创建一个问题，我将着手解决它。

欢迎加入QQ群-->： 979659372

html_librarian 0.0.1

html_librarian的Python项目详细描述

推荐PyPI第三方库

odoo11-addon-l10n-ro-hr

run-timer

openccp

wrapA

cryptotranslator

django-exchange-maploom

bytesinsert

pianofish

younitedlib

m01.logger

pygrib2

creep

tempdirs

opt-einsum

masakari-dashboard

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

html_librarian 0.0.1

html_librarian的Python项目详细描述

推荐PyPI第三方库

odoo11-addon-l10n-ro-hr

run-timer

openccp

wrapA

cryptotranslator

django-exchange-maploom

bytesinsert

pianofish

younitedlib

m01.logger

pygrib2

creep

tempdirs

opt-einsum

masakari-dashboard

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签