使用Beautiful Soup去除字符串中的HTML标签

6 投票

1 回答

9579 浏览

提问于 2025-04-16 08:26

有没有人能提供一些示例代码，展示如何使用Python的Beautiful Soup来去掉一段文本中的所有HTML标签，除了某些特定的标签？

我想去掉所有的JavaScript和HTML标签，除了：

<a></a>
<b></b>
<i></i>

还有像这样的标签：

<a onclick=""></a>

谢谢大家的帮助——我在网上找不到太多相关的信息。

1 个回答

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

产生

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

如果你只想要文本内容，可以把 print(tag) 改成 print(tag.string)。

如果你想从 a 标签中去掉像 onclick="" 这样的属性，可以这样做：

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)

回答于 2025-04-16 由 Python大师

分享举报

使用Beautiful Soup去除字符串中的HTML标签

1 个回答

撰写回答