使用BeautifulSoup解析HTML页面

4 投票

2 回答

1125 浏览

提问于 2025-04-17 16:06

我开始使用beautifulsoup来解析HTML网页。
比如说，对于这个网站 "http://en.wikipedia.org/wiki/PLCB1"

import sys
sys.setrecursionlimit(10000)

import urllib2, sys
from BeautifulSoup import BeautifulSoup

site= "http://en.wikipedia.org/wiki/PLCB1"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find('table', {'class':'infobox'})
#print table
rows = table.findAll("th")
for x in rows:
    print "x - ", x.string

在某些情况下，我在th标签中获取的输出是None，特别是当里面有网址的时候。这是为什么呢？

输出结果：

x -  Phospholipase C, beta 1 (phosphoinositide-specific)
x -  Identifiers
x -  None
x -  External IDs
x -  None
x -  None
x -  Molecular function
x -  Cellular component
x -  Biological process
x -  RNA expression pattern
x -  Orthologs
x -  Species
x -  None
x -  None
x -  None
x -  RefSeq (mRNA)
x -  RefSeq (protein)
x -  Location (UCSC)
x -  None

举个例子，在“Location”后面，还有一个th标签，里面有“pubmed search”，但它的输出却是None。我想知道这是为什么。

还有
第二个问题：有没有办法把th和对应的td放到一个字典里，这样解析起来会更方便？

数据提取网页抓取 html解析 beautifulsoup 字典结构网页内容分析 td标签 th标签

2 个回答

如果你查看一下html代码，

<th colspan="4" style="text-align:center; background-color: #ddd">Identifiers</th>
</tr>
<tr class="">
<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th>
<td colspan="3" class="" style="background-color: #eee"><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.genenames.org/data/hgnc_data.php?hgnc_id=15917">PLCB1</a>; EIEE12; PI-PLC; PLC-154; PLC-I; PLC154; PLCB1A; PLCB1B</span></td>
</tr>
<tr class="">
<th style="background-color: #c3fdb8">External IDs</th>

你会发现，在Identifiers和External IDs之间，有一个<th>标签里面没有文字，只有一个<a>标签：

<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th>

这个<th>标签里面没有文字。所以x.string的值是None。

回答于 2025-04-17 由 Python大师

分享举报

Element.string 这个功能只会在元素里面有直接的文本时才会返回值。如果元素里面有嵌套的其他元素，它们的文本是不会被包含在内的。

如果你在使用 BeautifulSoup 4，建议使用 Element.stripped_strings 这个功能：

print ''.join(x.stripped_strings)

如果你使用的是 BeautifulSoup 3，你需要查找所有的文本元素：

print ''.join([unicode(t).strip() for t in x.findAll(text=True)])

如果你想把 <th> 和 <td> 元素合并成一个字典，你需要遍历所有的 <th> 元素，然后用 .findNextSibling() 来找到对应的 <td> 元素，再结合上面的 .findAll(text=True) 方法来构建你的字典：

info = {}
rows = table.findAll("th")
for headercell in rows:
    valuecell = headercell.findNextSibling('td')
    if valuecell is None:
        continue
    header = ''.join([unicode(t).strip() for t in headercell.findAll(text=True)])
    value = ''.join([unicode(t).strip() for t in valuecell.findAll(text=True)])
    info[header] = value

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup解析HTML页面

2 个回答

撰写回答