在BeautifulSoup中查找标签和文本

3 投票

1 回答

4324 浏览

提问于 2025-04-16 23:05

我在用BeautifulSoup写一个findAll查询的时候遇到了一些麻烦，想要实现我想要的效果。之前，我用findAll来提取一些html中的文本，基本上就是把所有的标签都去掉了。比如，如果我有：

<b>Cows</b> are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.

那么它会变成：

Cows are being abducted by aliens according to the Washington Post.

我这样做是通过使用''.join(html.findAll(text=True))。这个方法一直很好用，直到我决定只保留<a>标签，而把其他标签都去掉。所以，基于最初的例子，我希望得到这个：

Cows are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.

我最开始以为下面这个方法可以解决问题：

''.join(html.findAll({'a':True}, text=True))

但是，这个方法不行，因为text=True似乎只会找到文本。我需要的是一种“或”的选项——我想找到文本或者<a>标签。重要的是，这些标签要和它们所标记的文本保持在一起——我不能让标签或文本出现错乱。

有什么想法吗？

html解析信息提取解析库 beautifulsoup 数据抓取文本提取网页爬虫标签查找

1 个回答

注意：BeautifulSoup.findAll 是一个搜索工具。它的第一个参数叫做 name，可以用来限制搜索特定的标签。用一个 findAll 方法不能同时选择标签之间的所有文本和 <a> 标签的文本及其标签。所以我想出了下面的解决方案。

这个解决方案需要先导入 BeautifulSoup.Tag。

from BeautifulSoup import BeautifulSoup, Tag

soup = BeautifulSoup('<b>Cows</b> are being abducted by aliens according to the <a href="www.washingtonpost.com>Washington Post</a>.')
parsed_soup = ''

我们可以像处理列表一样，用 contents 方法来遍历解析树。只有当遇到标签且这个标签不是 <a> 时，我们才提取文本。否则，我们就会获取包含标签的整个字符串。这是利用了解析树导航的工具。

for item in soup.contents:
    if type(item) is Tag and u'a' != item.name:
        parsed_soup += ''.join(item.findAll(text = True))
    else:
        parsed_soup += unicode(item)

文本的顺序是保留的。

 >>> print parsed_soup
 u'Cows are being abducted by aliens according to the <a href=\'"www.washingtonpost.com\'>Washington Post</a>.'

回答于 2025-04-16 由 Python大师

分享举报

在BeautifulSoup中查找标签和文本

1 个回答

撰写回答