Python hext包_程序模块 - PyPI

从html中提取结构化数据的模块和命令行实用程序

hext的Python项目详细描述

hext-从html中提取数据

Hext Logo

hext是一种特定于领域的语言，用于从html中提取结构化数据。它可以被认为是模板的对应物，模板通常被web开发人员用来构建web上的内容。

一个简单的例子

下面的Hext snippet收集了所有的超链接，并提取出了ref和可点击的文本

<a href:link @text:title />

Hext是通过递归地尝试匹配每个HTML元素来实现的在上面的例子中，元素需要有标记a和名为ref的属性。如果元素匹配，则其属性ref和文本表示分别存储为link和title

如果上面的Hext snippet应用于这段HTML：

<body>
  <a href="one.html">  Page 1</a>
  <a href="two.html">  Page 2</a>
  <a href="three.html">Page 3</a>
</body>

hext将产生以下值：

{ "link": "one.html",   "title": "Page 1" },
{ "link": "two.html",   "title": "Page 2" },
{ "link": "three.html", "title": "Page 3" }

可以在Hext’s live code editor中使用此示例。请访问Hext’s documentation及其“How Hext Matches Elements”部分以获得更详细的解释

部件

此套餐包括：

Hext Python模块
htmlext命令行实用程序

对Python使用Hext

模块公开三个接口：

html = hext.Html("<html>...</html>")->；对象
rule = hext.Rule("...")->；对象
rule.extract(html)->；字符串字典{string->；string}

import hext
import requests
import json

res = requests.get('https://news.ycombinator.com/')
res.raise_for_status()

# hext.Html's constructor expects a single argument
# containing an UTF-8 encoded string of HTML.
html = hext.Html(res.text)

# hext.Rule's constructor expects a single argument
# containing a Hext snippet.
# Throws an exception of type ValueError on invalid syntax.
rule = hext.Rule("""
<tr>
  <td><span @text:rank /></td>
  <td><a href:href @text:title /></td>
</tr>
<?tr>
  <td>
    <span @text:score />
    <a @text:user />
    <a:last-child @text:filter(/\d+/):comment_count />
  </td>
</tr>""")

# hext.Rule.extract expects an argument of type hext.Html.
# Returns a list of dictionaries.
result = rule.extract(html)

# Print each dictionary as JSON
for map in result:
    print(json.dumps(map, ensure_ascii=False,
                          separators=(',',':')))

在命令行上使用Hext

hext附带一个名为htmlext的命令行实用程序，它将hext片段应用于html文档并输出json。

htmlext - Extract structured content from HTML.

Usage:
  htmlext [options] <hext-file> <html-file...>
      Apply extraction rules from <hext-file> to each
      <html-file> and print the captured content as JSON.

Options:
  -x [ --hext ] <file>  Add Hext from file
  -i [ --html ] <file>  Add HTML from file
  -s [ --str ] <string> Add Hext from string
  -c [ --compact ]      Print one JSON object per line
  -p [ --pretty ]       Pretty-print JSON
  -a [ --array ]        Wrap results in a JSON array
  -f [ --filter ] <key> Print values whose name matches <key>
  -l [ --lint ]         Do Hext syntax check
  -h [ --help ]         Print this help message
  -V [ --version ]      Print info and version

有没有想过在vlc中观看/r/videos上的提交？好吧，看看这个小家伙：

htmlext \
  -i <(wget -O- -o/dev/null "https://old.reddit.com/r/videos/") \
  -s '<a class="title" href:x />' \
  -f x \
  | xargs vlc

许可证

Hext是根据apache许可证v2.0的条款发布的。源代码托管在Github上。此二进制软件包包含由第三方编写的内容：

欢迎加入QQ群-->： 979659372

hext 0.2.3

hext的Python项目详细描述

hext-从html中提取数据

一个简单的例子

部件

对Python使用Hext

在命令行上使用Hext

许可证

推荐PyPI第三方库

pythonic-binance

loop-listen

GXQ

hepdata-converter

bespin

blobworld

bgionline

upt-fedora

base58

textsummarization

fossilcicli

pymmonit

fanova

flumine

django-crosswalk-client

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

hext 0.2.3

hext的Python项目详细描述

hext-从html中提取数据

一个简单的例子

部件

对Python使用Hext

在命令行上使用Hext

许可证

推荐PyPI第三方库

pythonic-binance

loop-listen

GXQ

hepdata-converter

bespin

blobworld

bgionline

upt-fedora

base58

textsummarization

fossilcicli

pymmonit

fanova

flumine

django-crosswalk-client

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签