使用html5lib或bleach移除<style>...</style>标签内容

5 投票

2 回答

1866 浏览

提问于 2025-04-17 02:59

我一直在使用一个很棒的库叫做 bleach，它可以用来去掉不好的HTML代码。

我有很多HTML文档是从Microsoft Word粘贴过来的，里面包含了一些像这样的内容：

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

使用bleach（默认情况下不允许使用style标签）后，我得到的结果是：

st1:*{behavior:url(#ieooui) }

这并没有什么帮助。bleach似乎只有以下几种选项：

转义标签；
去掉标签（但不去掉它们的内容）。

我在寻找第三种选择——既去掉标签又去掉它们的内容。

有没有办法使用bleach或html5lib来完全去掉style标签及其内容呢？html5lib的文档似乎也没有太大帮助。

数据清洗文档解析标签处理 html5lib 内容过滤 html清理 bleach 不良代码

2 个回答

我用了一种方法，成功去掉了标签里的内容，具体是通过一个过滤器实现的，详细信息可以查看这个链接：https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters。虽然输出结果中会留下一个空的 <style></style> 标签，但这并没有什么大碍。

from bleach.sanitizer import Cleaner
from bleach.html5lib_shim import Filter

class StyleTagFilter(Filter):
    """
    https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters
    """

    def __iter__(self):
        in_style_tag = False
        for token in Filter.__iter__(self):
            if token["type"] == "StartTag" and token["name"] == "style":
                in_style_tag = True
            elif token["type"] == "EndTag":
                in_style_tag = False
            elif in_style_tag:
                # If we are in a style tag, strip the contents
                token["data"] = ""
            yield token


# You must include "style" in the tags list
cleaner = Cleaner(tags=["div", "style"], strip=True, filters=[StyleTagFilter])
cleaned = cleaner.clean("<div><style>.some_style { font-weight: bold; }</style>Some text</div>")

assert cleaned == "<div><style></style>Some text</div>"

回答于 2025-04-17 由 Python大师

分享举报

结果发现，lxml这个工具更适合这个任务：

from lxml.html.clean import Cleaner

def clean_word_text(text):
    # The only thing I need Cleaner for is to clear out the contents of
    # <style>...</style> tags
    cleaner = Cleaner(style=True)
    return cleaner.clean_html(text)

回答于 2025-04-17 由 Python大师

分享举报

使用html5lib或bleach移除<style>...</style>标签内容

2 个回答

撰写回答