Python scrapepath包_程序模块 - PyPI

模板刮削语法

scrapepath的Python项目详细描述

刮削路径

Scrapepath是一种模板化的web抓取语法。Scrapepath is pip installable通过pip install scrapepath。

要求

使用提供的requirements.txt文件安装所需的python依赖项，方法是：

pip install -r requirements.txt

用法

要运行示例，请在不带参数的命令行上执行：

./parser

在python中使用：

fromparserimportNodeParsernp=NodeParser(soup_template,soup,live_url)np.hop_template()print(json.dumps(np.result_dict,indent=2,default=str))

其中soup_template是模板文件的BeautifulSoup，soup是被刮页的BeautifulSoup，以及live_url是被刮页的url。

模板

html页面是使用html模板（由最重要的标记和语句组成）进行刮除的。

模板由HTML文件组成，其中包含指向感兴趣的刮削元素的嵌套标记。

解析器基于BeautifulSoup。

示例1：抓取数据

下面的例子来自于被刮掉的页面examples/example1a.html和模板examples/scraped1.html。使用以下命令运行示例：

./parser.py examples/example1a.html examples/scraped1.html

这将使用template example1a.html刮取目标页scraped1.html。文本项“tea”是使用模板页中的record属性从目标页刮取的。在模板中使用与目标页相对应的标记指定目标文本（“tea”）的路径。所以，要从中刮取：

<ulclass ="my_list"><liclass ="my_item">Coffee</li><liclass ="my_item"><spanclass ="cuppa">Tea</span></li><liclass ="my_item">Milk</li></ul>

使用模板：

<ulclass ="my_list"><spanclass ="cuppa"record ="text as favorite"></span></ul>

这将生成一个字典，其中包含在record属性中指定的键“favorite”下的已刮除数据：

{"favorite":"Tea"}

record属性中的text语句对应于从html标记中获取文本的函数，favorite是记录数据的键。可以用自定义python函数替换text函数。

从外部节点<ul>开始，在模板中，解析器在被刮页中查找与模板节点在类型和属性中匹配的第一个节点。在这种情况下，将ul与ul匹配，并将my_list与class my_list分类。然后，使用模板节点的子节点进行相同的搜索，这些子节点现在被限制在被刮节点的子节点中。所以嵌套的模板节点表示路径。模板中不包含<li>节点，因为它会将搜索指向列表的第一个元素。

在这种情况下，嵌套模板节点是不必要的特定的。“cuppa”类没有其他节点，因此可以省略<ul>和<li>项，下面的模板将记录相同的数据：

<spanclass ="cuppa"record ="text as favorite"></span>

因此，沿被刮页中许多嵌套节点的路径只能由少数几个节点汇总，这些节点定义了被刮数据的唯一路径。

循环：

循环将刮除列表中的所有项。在这个简单的例子中，我们只记录每个项的一个变量（item_text）：

模板：

<ulclass ="my_list"><foritems ="items"condition ="i < 5"><liclass ="my_item"record ="text as item_text"></li></for></ul>

这将导致输出：

{"items":[{"item_text":"Coffee"},{"item_text":"Tea"},{"item_text":"Milk"},{"item_text":"Biscuits"},{"item_text":"Chocolate"}]}

这里，解析器将<for>模板节点的所有子节点与被刮页scraped1.html中的<ul>节点的子节点匹配。使用./parser.py examples/example1b.html examples/scraped1.html运行示例。condition节点表示只应记录前5项，其中i是循环计数器变量。

示例2:对于混合节点上的循环

在下面的html中，<for>模板循环节点需要包含两个模板节点，每个标记（div和p）和类（my item和milk）各一个：

刮取：

<divclass ="my_list"><divclass ="my_item">Coffee</div><divclass ="my_item"><spanclass ="cuppa">Tea</span></div><pclass ="milk_class">Milk</p><divclass ="my_item">Biscuits</div>
  Chocolate
</div>

使用模板：

<divclass ="my_list"><foritems ="items"><divclass ="my_item"record ="text as item_text"></div><pclass ="milk_class"record ="text as item_text"></p></for></div>

但是，<for>模板循环节点无法记录文本元素“chocolate”，因为<for>只在<div class = "my_list">节点的子节点中寻找适当的节点。为此，需要一个<forchild>模板循环节点，以及一个<str>模板节点来记录NavigableString元素“chocolate”：

模板：

<divclass ="my_list"><forchilditems ="items_with_string"><divclass ="my_item"record ="text as item_text"></div><pclass ="milk_class"record ="text as item_text"></p><strrecord ="text as item_text"></div></forchild></div>

在这种情况下，解析器查找与第一个模板节点（即<for>节点的子节点）的第一个匹配项，并在其子节点上循环，用所有模板节点（即节点的子节点）进行探测。使用examples/example1b.html和examples/scraped1.html运行此示例。

示例3：跳转到链接页

使用<jump>模板节点跟踪页面上的链接：

刮取：

<ahref="example_linked.html"></a>

使用模板：

<arecord ="href as my_link"><jumpon ="my_link"><ibody><divclass ="message"record ="text as msg_from_link"></div></ibody></jump><a>

这里，<jump>节点中的节点作用于链接页面。

调用此示例时使用：

./parser.py examples/example3a.html examples/scraped3.html

欢迎加入QQ群-->： 979659372

scrapepath 0.1.1

scrapepath的Python项目详细描述

刮削路径

要求

用法

模板

示例1：抓取数据

示例2:对于混合节点上的循环

示例3：跳转到链接页

推荐PyPI第三方库

ksellikepython

skal

helga-mail

alfred-jira

odoo12-addon-project-task-default-stage

ancillamap

pyless

odoo9-addon-pos-session-summar

cubicweb-forgotpwd

JackkillianAutoPackager

slack-tangerine

pymongo

wttr

mixcoatl

gpalign

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

scrapepath 0.1.1

scrapepath的Python项目详细描述

刮削路径

要求

用法

模板

示例1：抓取数据

示例2:对于混合节点上的循环

示例3：跳转到链接页

推荐PyPI第三方库

ksellikepython

skal

helga-mail

alfred-jira

odoo12-addon-project-task-default-stage

ancillamap

pyless

odoo9-addon-pos-session-summar

cubicweb-forgotpwd

JackkillianAutoPackager

slack-tangerine

pymongo

wttr

mixcoatl

gpalign

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签