如何在Python中移除所有HTML？

Question

有没有办法用lxml.html来去掉或处理HTML标签，而不是用beautifulsoup，因为后者有一些XSS安全问题？我试过用cleaner，但我想要去掉所有的HTML。

Answer 1

这个方法使用了lxml的清理功能，但避免了结果被包裹在一个HTML元素里。

import lxml

doc = lxml.html.document_fromstring(str) 
cleaner = lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False)
str = cleaner.clean_html(doc).text_content()

或者可以用一行代码来实现

lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False).clean_html(lxml.html.document_fromstring(str)).text_content()

它的工作原理是手动将HTML解析成一个文档对象，然后把这个对象传给清理类。这样，clean_html返回的也是一个对象，而不是字符串。然后可以通过text_content()方法提取文本，而不需要额外的包装元素。

Answer 2

我觉得这段代码可以帮到你：

from lxml.html.clean import Cleaner

html_text = "<html><head><title>Hello</title><body>Text</body></html>"
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False)
cleaned_text = cleaner.clean_html(html_text)

Answer 3

试试在一个元素上使用 .text_content() 这个方法，最好是在用 lxml.html.clean 清理掉一些不需要的内容（比如脚本标签等）之后。举个例子：

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()

如何在Python中移除所有HTML？

3 个回答

撰写回答