Python htmlement包_程序模块 - PyPI

纯python html解析器，支持elementtree。

htmlement的Python项目详细描述

https://readthedocs.org/projects/python-htmlement/badge/?version=stable

https://travis-ci.org/willforde/python-htmlement.svg?branch=master

https://coveralls.io/repos/github/willforde/python-htmlement/badge.svg?branch=master

https://api.codacy.com/project/badge/Grade/6b46406e1aa24b95947b3da6c09a4ab5

https://img.shields.io/pypi/pyversions/htmlement.svg

https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg

元素

htmlement是一个纯python的html解析器。

这个项目的目标是成为一个“纯python html解析器”，它比“beautifulsoup”还要“快”。而且像“beautifulsoup”一样，也会解析无效的html。

最简单的方法是使用elementtreeXPath expressions。 python的“elementtree”模块中确实支持一个简单的（读受限的）xpath引擎。使用“elementtree”的好处是它可以在任何时候使用“c实现”。

这个“html解析器”扩展了html.parser.HTMLParser来构建一个由ElementTree.Element个实例组成的树。

安装

运行

pip install htmlement

-或-

pip install git+https://github.com/willforde/python-htmlement.git

分析HTML

这里我将使用一个示例“html文档”，它将使用“htmlement”进行“解析”：

html = """
<html>
  <head>
    <title>GitHub</title>
  </head>
  <body>
    <a href="https://github.com/marmelo">GitHub</a>
    <a href="https://github.com/marmelo/python-htmlparser">GitHub Project</a>
  </body>
</html>
"""

# Parse the document
import htmlement
root = htmlement.fromstring(html)

根是一个ElementTree.Element，支持elementtree api 使用xpath表达式。有了这个，我可以很容易地得到标题和所有的锚在文件中。

# Get title
title = root.find("head/title").text
print("Parsing: %s" % title)

# Get all anchors
for a in root.iterfind(".//a"):
    print(a.get("href"))

输出如下：

Parsing: GitHub
https://github.com/willforde
https://github.com/willforde/python-htmlement

使用筛选器分析HTML

在这里，我将使用稍微复杂一点的“html文档”，该文档将使用“htmlelement with a filter”进行“解析”以获取只有菜单项。这在处理大型“html文档”时非常有用，因为只需“解析所需的部分”并忽略其他所有内容。

html = """
<html>
  <head>
    <title>Coffee shop</title>
  </head>
  <body>
    <ul class="menu">
      <li>Coffee</li>
      <li>Tea</li>
      <li>Milk</li>
    </ul>
    <ul class="extras">
      <li>Sugar</li>
      <li>Cream</li>
    </ul>
  </body>
</html>
"""

# Parse the document
import htmlement
root = htmlement.fromstring(html, "ul", attrs={"class": "menu"})

在这种情况下，我不是不能得到标题，因为过滤器之外的所有元素都被忽略了。但这使我能够提取菜单列表中的所有“list_item elements”，而不提取任何其他元素。

# Get all listitems
for item in root.iterfind(".//li"):
    # Get text from listitem
    print(item.text)

输出如下：

Coffee
Tea
Milk

欢迎加入QQ群-->： 979659372

htmlement 1.0.0

htmlement的Python项目详细描述

元素

安装

分析HTML

使用筛选器分析HTML

推荐PyPI第三方库

pytest-salt-runtests-bridge

vcomp

pipetools

PyBCM2835

pywe-message-repl

ec2-task

pusherclientb

pycopy-sched

tingcli

leicacam

gorun

Flask-Velox

throttling

odoo8-addon-hr-expense-account-period

resolver

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

htmlement 1.0.0

htmlement的Python项目详细描述

元素

安装

分析HTML

使用筛选器分析HTML

推荐PyPI第三方库

pytest-salt-runtests-bridge

vcomp

pipetools

PyBCM2835

pywe-message-repl

ec2-task

pusherclientb

pycopy-sched

tingcli

leicacam

gorun

Flask-Velox

throttling

odoo8-addon-hr-expense-account-period

resolver

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签