删除python中的所有html？

网友

1楼 · 编辑于 2024-05-23 21:35:14

这使用了lxml的清理函数，但避免了结果被包装在HTML元素中。

import lxml

doc = lxml.html.document_fromstring(str) 
cleaner = lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False)
str = cleaner.clean_html(doc).text_content()

或者作为一个单列

lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False).clean_html(lxml.html.document_fromstring(str)).text_content()

它的工作方式是将html手动解析为一个document对象，并将其交给cleaner类。这样clean_html也返回一个对象而不是一个字符串。然后可以使用text_content（）方法在没有包装元素的情况下恢复文本。

网友

2楼 · 编辑于 2024-05-23 21:35:14

我相信，这段代码可以帮助您：

from lxml.html.clean import Cleaner

html_text = "<html><head><title>Hello</title><body>Text</body></html>"
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False)
cleaned_text = cleaner.clean_html(html_text)

网友

3楼 · 编辑于 2024-05-23 21:35:14

在元素上尝试.text_content()方法，最好是在使用lxml.html.clean来清除不需要的内容（脚本标记等）之后。例如：

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()

相关问题更多 >

编程相关推荐

热门问题

热门文章

删除python中的所有html？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >