如何用Beautiful Soup选择div内的div类?

Question

我有一堆

标签嵌套在其他

标签里面：

<div class="foo">
     <div class="bar">I want this</div>
     <div class="unwanted">Not this</div>
</div>
<div class="bar">Don't want this either
</div>

所以我正在用Python和Beautiful Soup来把这些东西分开。我需要的是只有当“bar”类被包裹在“foo”类的

里面时，才提取出来。以下是我的代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(r'C:\test.htm'))
tag = soup.div
for each_div in soup.findAll('div',{'class':'foo'}):
    print(tag["bar"]).encode("utf-8")

另外，我还尝试了：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(r'C:\test.htm'))
for each_div in soup.findAll('div',{'class':'foo'}):
     print(each_div.findAll('div',{'class':'bar'})).encode("utf-8")

我哪里做错了？如果能把

类“unwanted”从选择中去掉，我其实只想简单地打印出每个

。

数据提取 beautiful soup 嵌套结构类选择器 html 解析 div 标签

1 个回答

24

你可以使用 find_all() 来查找所有带有 foo 属性的 <div> 元素，然后对每一个这样的元素再使用 find() 来查找那些带有 bar 属性的元素，像这样：

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for foo in soup.find_all('div', attrs={'class': 'foo'}):
    bar = foo.find('div', attrs={'class': 'bar'})
    print(bar.text)

运行方式如下：

python3 script.py htmlfile

这样会得到：

I want this

更新：假设可能存在多个带有 bar 属性的 <div> 元素，之前的脚本就不适用了。它只会找到第一个。但你可以获取它们的子元素并进行遍历，像这样：

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for foo in soup.find_all('div', attrs={'class': 'foo'}):
    foo_descendants = foo.descendants
    for d in foo_descendants:
        if d.name == 'div' and d.get('class', '') == ['bar']:
            print(d.text)

输入示例：

<div class="foo">
     <div class="bar">I want this</div>
     <div class="unwanted">Not this</div>
     <div class="bar">Also want this</div>
</div>

这样会得到：

I want this
Also want this

回答于 2025-04-17 由 Python大师

分享举报

如何用Beautiful Soup选择div内的div类?

1 个回答

撰写回答