在BeautifulSoup中用标签包裹文本的子部分

6 投票

2 回答

2301 浏览

提问于 2025-04-18 00:14

我想要一个BeautifulSoup的用法，类似于这个jQuery的问题。

我想在BeautifulSoup的文本中找到一个特定的正则表达式匹配，然后把那部分文本替换成一个包裹过的版本。我可以用纯文本来包裹：

# replace all words ending in "ug" wrapped in quotes,
# with "ug" replaced with "ook"

>>> soup = BeautifulSoup("Snug as a bug in a rug")
>>> soup
<html><body><p>Snug as a bug in a rug</p></body></html>
>>> for text in soup.findAll(text=True):
...   if re.search(r'ug\b',text):
...     text.replaceWith(re.sub(r'(\w*)ug\b',r'"\1ook"',text))
...
u'Snug as a bug in a rug'
>>> soup
<html><body><p>"Snook" as a "book" in a "rook"</p></body></html>

但是如果我想要加粗而不是引号呢？比如说，想要的结果是：

<html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>

正则表达式文本替换数据处理网页抓取 html解析 beautifulsoup 加粗文本标签包裹

2 个回答

这里有一种方法可以做到。你可以使用正则表达式来创建新的HTML，把需要加粗的词包裹起来，然后把这个新内容放进BeautifulSoup的构造函数里，最后用新的p标签替换掉原来的整个父级p标签。

import bs4
import re

soup = bs4.BeautifulSoup("Snug as a bug in a rug")
print soup

for text in soup.findAll(text=True):
    if re.search(r'ug\b',text):
        new_html = "<p>"+re.sub(r'(\w*)ug\b', r'<b>\1ook</b>', text)+"</p>"
        new_soup = bs4.BeautifulSoup(new_html)
        text.parent.replace_with(new_soup.p)

print soup

还有一种选择是使用soup.new_tag方法，但这可能需要用到嵌套的for循环，这样就没那么简洁了。我会看看能不能写出来，稍后再发到这里。

回答于 2025-04-18 由 Python大师

分享举报

for text in soup.findAll(text=True):
   if re.search(r'ug\b',text):
     text.replaceWith(BeautifulSoup(re.sub(r'(\w*)ug\b',r'<b>\1ook</b>',text),'html.parser'))

soup
Out[117]: <html><body><p><b>Snook</b> as a <b>book</b> in a <b>rook</b></p></body></html>

这里的想法是，我们用一个完整的解析树来替换一个标签。最简单的方法就是直接对我们用正则表达式替换后的字符串调用 BeautifulSoup。

这里的 'html.parser' 参数有点神奇，它的作用是防止 BeautifulSoup 自动添加 <html><body><p> 这些标签，通常情况下，bs4（其实是lxml）会这样做。更多相关内容可以阅读这里。

回答于 2025-04-18 由 Python大师

分享举报

在BeautifulSoup中用标签包裹文本的子部分

2 个回答

撰写回答