如何用Beautiful Soup选择div内的div类?
我有一堆
标签嵌套在其他
标签里面:
<div class="foo">
<div class="bar">I want this</div>
<div class="unwanted">Not this</div>
</div>
<div class="bar">Don't want this either
</div>
所以我正在用Python和Beautiful Soup来把这些东西分开。我需要的是只有当“bar”类被包裹在“foo”类的
里面时,才提取出来。以下是我的代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(r'C:\test.htm'))
tag = soup.div
for each_div in soup.findAll('div',{'class':'foo'}):
print(tag["bar"]).encode("utf-8")
另外,我还尝试了:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(r'C:\test.htm'))
for each_div in soup.findAll('div',{'class':'foo'}):
print(each_div.findAll('div',{'class':'bar'})).encode("utf-8")
我哪里做错了?如果能把
类“unwanted”从选择中去掉,我其实只想简单地打印出每个
。
1 个回答
24
你可以使用 find_all()
来查找所有带有 foo
属性的 <div>
元素,然后对每一个这样的元素再使用 find()
来查找那些带有 bar
属性的元素,像这样:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for foo in soup.find_all('div', attrs={'class': 'foo'}):
bar = foo.find('div', attrs={'class': 'bar'})
print(bar.text)
运行方式如下:
python3 script.py htmlfile
这样会得到:
I want this
更新:假设可能存在多个带有 bar
属性的 <div>
元素,之前的脚本就不适用了。它只会找到第一个。但你可以获取它们的子元素并进行遍历,像这样:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for foo in soup.find_all('div', attrs={'class': 'foo'}):
foo_descendants = foo.descendants
for d in foo_descendants:
if d.name == 'div' and d.get('class', '') == ['bar']:
print(d.text)
输入示例:
<div class="foo">
<div class="bar">I want this</div>
<div class="unwanted">Not this</div>
<div class="bar">Also want this</div>
</div>
这样会得到:
I want this
Also want this