Python 嵌套标签网页抓取

2 投票

2 回答

7995 浏览

提问于 2025-04-18 00:47

我正在从一个特定的网站上抓取固定的内容。这个内容位于一个嵌套的div里面，如下所示：

<div class="table-info">
  <div>
    <span>Time</span>
        <div class="overflow-hidden">
            <strong>Full</strong>
        </div>
  </div>
  <div>
    <span>Branch</span>
        <div class="overflow-hidden">
            <strong>IT</strong>
        </div>
  </div>
  <div>
    <span>Type</span>
        <div class="overflow-hidden">
            <strong>Standard</strong>
        </div>
  </div>
  <div>
    <span>contact</span>
        <div class="overflow-hidden">
            <strong>my location</strong>
        </div>
 </div>
</div>

我想要获取的是在一个名为'overflow-hidden'的div里面，包含字符串“Branch”的span中的strong标签的内容。我使用的代码是：

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('span')
print type

我已经抓取了主div 'table-info'里面所有的span内容，这样我就可以用条件语句来获取需要的内容。但是如果我尝试抓取span里面的div内容，如下所示：

type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')
print type

我就会遇到错误：

AttributeError: 'list' object has no attribute 'find'

有没有人能给我一些建议，告诉我怎么才能获取span里面div的内容？谢谢！我使用的是python2.7。

网页抓取条件语句数据抓取嵌套div 内容提取 div结构 span元素 strong标签

2 个回答

findAll这个方法会返回一堆Beautiful Soup（简称BS）元素的列表，而find这个方法是用在单个BS对象上的，不是用在一堆BS对象上的，所以才会出错。你代码的前半部分是没问题的，

可以这样做：

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')
branch_span = span[1]
# Do you manipulation with the branch_span

或者

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')

for span in spans:
    if span.text.lower() == 'branch':
        # Do your manipulation

回答于 2025-04-18 由 Python大师

分享举报

看起来你想从名为 "table-info" 的 div 里面获取第二个 div 的内容。不过，你现在用的标签和你想要访问的内容没有关系。

 type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')

这样会返回错误，因为它是空的。

不如试试这个：

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('div')
print type[2].find('strong').string

回答于 2025-04-18 由 Python大师

分享举报

Python 嵌套标签网页抓取

2 个回答

撰写回答