Python BeautifulSoup - 在找到的关键词周围添加标签

3 投票

2 回答

4364 浏览

数据工程师

提问于 2025-04-17 14:28

我现在在做一个项目，想在一大堆HTML文件中实现正则表达式搜索。

首先，我找到了我感兴趣的文件，现在我想把找到的关键词高亮显示出来！

我使用BeautifulSoup可以确定关键词所在的节点。我的一个做法是改变整个父节点的颜色。

不过，我还想在我找到的关键词周围加上自己的<span>标签。

使用BeautifulSoup提供的find()函数来确定位置什么的其实没什么难的。但在普通文本周围添加我的标签似乎是不可能的？

# match = keyword found by another regex
# node = the node I found using the soup.find(text=myRE)
node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))

这样我只是在添加普通文本，而不是一个合适的标签，因为文档并没有被重新解析，我希望能避免这种情况！

我希望我的问题能稍微清楚一些 :)

正则表达式数据处理 HTML beautifulsoup 文档解析标签添加节点操作关键词高亮

2 个回答

如果你添加了这些文本...

my_tag = node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))

...然后再通过BeautifulSoup处理一次

new_soup = BeautifulSoup(my_tag)

它应该会被识别为一个BS标签对象，可以进行解析。

你可以把这些修改应用到原始的大段文本上，然后整体处理，这样就可以避免重复。

编辑：

来自文档：

# Here is a more complex example that replaces one tag with another: 

from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>")
tag = Tag(soup, "newTag", [("id", 1)])
tag.insert(0, "Hooray!")
soup.a.replaceWith(tag)
print soup
# <b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>

回答于 2025-04-17 由 Python大师

分享举报

这里有一个简单的例子，展示了其中一种方法：

import re
from bs4 import BeautifulSoup as Soup

html = '''
<html><body><p>This is a paragraph</p></body></html>
'''

(1) 先把文本存起来，然后清空标签

soup = Soup(html)
text = soup.p.string
soup.p.clear()
print soup

(2) 找到需要加粗的词的开始和结束位置（抱歉我的英语不太好）

match = re.search(r'\ba\b', text)
start, end = match.start(), match.end()

(3) 把文本分开，先添加第一部分

soup.p.append(text[:start])
print soup

(4) 创建一个标签，把相关的文本放进去，然后把它加到父标签里

b = soup.new_tag('b')
b.append(text[start:end])
soup.p.append(b)
print soup

(5) 把剩下的文本也加上

soup.p.append(text[end:])
print soup

以上步骤的输出结果是：

<html><body><p></p></body></html>
<html><body><p>This is </p></body></html>
<html><body><p>This is <b>a</b></p></body></html>
<html><body><p>This is <b>a</b> paragraph</p></body></html>

回答于 2025-04-17 由 Python大师

分享举报

Python BeautifulSoup - 在找到的关键词周围添加标签

2 个回答

撰写回答