用Python和lxml模块从HTML中删除所有JavaScript和样式标签

36 投票

5 回答

25026 浏览

提问于 2025-04-17 08:36

我正在使用 http://lxml.de/ 这个库来解析一个HTML文档。到目前为止，我已经弄明白了如何从HTML文档中去掉标签，具体可以参考这个链接在lxml中，如何去掉标签但保留所有内容？。不过，那个方法只去掉了标签，留下了所有的文本，但没有去掉实际的脚本内容。我还发现了一个关于 lxml.html.clean.Cleaner 类的参考资料 http://lxml.de/api/lxml.html.clean.Cleaner-class.html，但这看起来复杂得让人摸不着头脑，我不太清楚该如何使用这个类来清理文档。如果能给我一些帮助，比如一个简单的例子，那就太好了！

lxml web scraping HTML xml parsing document cleaning javascript removal css removal cleaner class

5 个回答

你可以使用 strip_elements 这个方法来删除脚本，然后再用 strip_tags 这个方法来去掉其他的标签：

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

回答于 2025-04-17 由 Python大师

分享举报

这里有一些例子，教你如何从XML/HTML树中移除和解析不同类型的HTML元素。

重要提示：最好不要依赖外部库，而是用“原生的Python 2/3代码”来完成所有操作。

下面是一些用“原生”Python实现的例子……

# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

注意：

re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character"

我在几个不同的HTML文件上测试过，包括，和，它的运行速度“很快”，而且可以处理换行符！

注意：它也不依赖beautifulsoup或任何其他外部下载的库！

希望这对你有帮助！

回答于 2025-04-17 由 Python大师

分享举报

下面是一个可以实现你想要的功能的例子。对于一个HTML文档来说，Cleaner 是一个更好的通用解决方案，而不是使用 strip_elements。因为在这种情况下，你想要去掉的不仅仅是 <script> 标签；你还想去掉其他标签上的一些属性，比如 onclick=function()。

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

你可以在 lxml.html.clean.Cleaner 文档中查看可以设置的选项列表；有些选项你可以直接设置为 True 或 False（默认值），而其他选项则需要像这样提供一个列表：

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

注意一下“kill”和“remove”的区别：

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).

回答于 2025-04-17 由 Python大师

分享举报

用Python和lxml模块从HTML中删除所有JavaScript和样式标签

5 个回答

撰写回答