lxml: clean_html将html标签替换为div?
我在使用 lxml 3.1.0(通过 easy_install 安装),结果却有点奇怪:
> from lxml.html.clean import clean_html
> clean_html("<html><body><h1>hi</h1></body></html>")
'<div><body><h1>hi</h1></body></div>'
我发现 html
标签被替换成了 div
。
同样的情况也发生在这个示例 HTML 上,具体可以参考 http://lxml.de/lxmlhtml.html#cleaning-up-html
这是怎么回事?我是在遇到 lxml 的 bug,还是和 libxml2 的版本不兼容,或者说这是正常现象?
2 个回答
3
如果设置了 page_structure=True
(这是默认设置),那么页面的一些结构部分,比如 <head>
、<html>
和 <title>
,会被去掉。如果你想改变这个设置,可以参考下面的内容:
import lxml.html.clean as clean
content = '<html><body><h1>hi</h1></body></html>'
cleaner = clean.Cleaner(page_structure=False)
cleaned = cleaner.clean_html(content)
print(cleaned)
# <html><body><h1>hi</h1></body></html>
查看 clean.Cleaner
类的文档说明:
In [105]: clean.Cleaner?
Type: type
String Form:<class 'lxml.html.clean.Cleaner'>
File: /usr/lib/python2.7/dist-packages/lxml/html/clean.py
Definition: clean.Cleaner(self, doc)
Docstring:
Instances cleans the document of each of the possible offending
elements. The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.
``scripts``:
Removes any ``<script>`` tags.
``javascript``:
Removes any Javascript, like an ``onclick`` attribute.
``comments``:
Removes any comments.
``style``:
Removes any style tags or attributes.
``links``:
Removes any ``<link>`` tags
``meta``:
Removes any ``<meta>`` tags
``page_structure``:
Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
``processing_instructions``:
Removes any processing instructions.
``embedded``:
Removes any embedded objects (flash, iframes)
``frames``:
Removes any frame-related tags
``forms``:
Removes any form tags
``annoying_tags``:
Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>``
``remove_tags``:
A list of tags to remove.
``allow_tags``:
A list of tags to include (default include all).
``remove_unknown_tags``:
Remove any tags that aren't standard parts of HTML.
``safe_attrs_only``:
If true, only include 'safe' attributes (specifically the list
from `feedparser
<http://feedparser.org/docs/html-sanitization.html>`_).
``add_nofollow``:
If true, then any <a> tags will have ``rel="nofollow"`` added to them.
``host_whitelist``:
A list or set of hosts that you can use for embedded content
(for content like ``<object>``, ``<link rel="stylesheet">``, etc).
You can also implement/override the method
``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
implement more complex rules for what can be embedded.
Anything that passes this test will be shown, regardless of
the value of (for instance) ``embedded``.
Note that this parameter might not work as intended if you do not
make the links absolute before doing the cleaning.
``whitelist_tags``:
A set of tags that can be included with ``host_whitelist``.
The default is ``iframe`` and ``embed``; you may wish to
include other tags like ``script``, or you may want to
implement ``allow_embedded_url`` for more control. Set to None to
include all tags.
This modifies the document *in place*.
Constructor information:
Definition:clean.Cleaner(self, **kw)
5
我觉得你需要一个 Cleaner
,它可以不去碰 page_structure
:
>>> from lxml.html.clean import Cleaner
>>> cleaner = Cleaner(page_structure=False)
>>> cleaner.clean_html("<html><body><h1>hi</h1></body></html>")
'<html><body><h1>hi</h1></body></html>'
正如在这里所描述的,page_structure
默认是 True
。我怀疑你提供的网站上的文档可能不正确或者已经过时了。
编辑#1:在这个源代码的测试中,可以再次确认这是预期的行为。一个拉取请求已经提交,用来修正文档。
编辑#2:这个拉取请求在2013年4月28日已经合并到主分支中。