使用BeautifulSoup移除标签但保留其内容

64 投票

12 回答

67324 浏览

提问于 2025-04-15 16:11

现在我有一段代码，做的事情大致是这样的：

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

不过我不想把无效标签里面的内容丢掉。我该怎么做才能去掉这个标签，但保留里面的内容呢？在调用soup.renderContents()的时候应该怎么做？

html解析 beautifulsoup 标签处理内容保留

12 个回答

虽然在评论中已经有人提到过这个问题，但我还是想发个完整的回答，告诉大家怎么用Mozilla的Bleach来解决这个问题。就我个人而言，我觉得用Bleach比用BeautifulSoup要好很多。

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

回答于 2025-04-15 由 Python大师

分享举报

现在的BeautifulSoup库版本中，有一个不太被记录的方法，叫做replaceWithChildren()，这个方法是在Tag对象上的。你可以这样使用它：

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

看起来这个方法的表现正是你想要的，而且代码也比较简单明了（虽然它在DOM（文档对象模型）上会进行几次操作，但这个过程是可以优化的）。

回答于 2025-04-15 由 Python大师

分享举报

我使用的策略是，如果一个标签的内容是类型，就把这个标签替换成它的内容。如果不是这种类型，就继续深入查看它的内容，然后把这些内容替换成，以此类推。你可以试试这个方法：

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

结果是：

<p>Good, bad, and ugly</p>

我在另一个问题上也给出了同样的答案。这个问题似乎经常出现。

回答于 2025-04-15 由 Python大师

分享举报

使用BeautifulSoup移除标签但保留其内容

12 个回答

撰写回答