使用html5lib将HTML片段转换为纯文本
有没有简单的方法可以使用Python库html5lib,把像这样的内容:
<p>Hello World. Greetings from <strong>Mars.</strong></p>
转换成
Hello World. Greetings from Mars.
3 个回答
1
你可以把itertext()
方法的结果连接在一起。
举个例子:
import html5lib
d = html5lib.parseFragment(
'<p>Hello World. Greetings from <strong>Mars.</strong></p>')
s = ''.join(d.itertext())
print(s)
输出结果:
Hello World. Greetings from Mars.
4
我使用 html2text 这个工具,它可以把网页内容转换成纯文本(用Markdown格式)。
from html2text import HTML2Text
handler = HTML2Text()
html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
<br><h1>Nullam eget \r\ngravida elit</h1>Integer iaculis elit at risus feugiat:
<br><br><ul><li>Egestas non quis \r\nlorem.</li><li>Nam id lobortis felis.
</li><li>Sed tincidunt nulla.</li></ul>
At massa tempus, quis \r\nvehicula odio laoreet.<br>"""
text = handler.handle(html)
>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.\n\n \n\n# Nullam eget gravida elit\n\nInteger iaculis elit at risus feugiat:\n\n \n\n * Egestas non quis lorem.\n * Nam id lobortis felis.\n * Sed tincidunt nulla.\nAt massa tempus, quis vehicula odio laoreet.\n\n'
12
使用 lxml
作为解析器后端:
import html5lib
body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()
老实说,这其实有点作弊,因为它等同于下面的代码(只有相关部分有所更改):
from lxml import html
doc = html.fromstring(body)
print doc.text_content()
如果你真的想使用 html5lib
解析引擎:
from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")