BeautifulSoup HTML 表格解析

18 投票

2 回答

23575 浏览

提问于 2025-04-15 17:59

我正在尝试从这个网站提取信息（HTML表格）：http://www.511virginia.org/RoadConditions.aspx?j=All&r=1

目前我在使用BeautifulSoup，下面是我的代码：

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table")

rows = table.findAll('tr')[3]

cols = rows.findAll('td')

roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string

entry = (roadtype, start, end, condition, reason, update)

print entry

问题出在开始和结束的列上。它们的输出结果都是“None”。

输出结果：

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

我知道它们被存储在列的列表中，但似乎多出来的链接标签搞乱了原始HTML的解析，原始HTML看起来是这样的：

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td>
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td>
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td>
<td headers="condition" class="ConditionsCellText">Moderate</td>
<td headers="reason" class="ConditionsCellText">snow or ice</td>
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>

所以应该打印的内容是：

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

任何建议或帮助都非常感谢，提前谢谢大家。

网络编程数据提取网页抓取 html解析信息提取 beautifulsoup 表格处理标签解析

2 个回答

我试着重现你遇到的错误，但源网页的内容已经改变了。

关于这个错误，我之前也遇到过类似的问题，想要重现那个例子可以在这里找到。

我把原来的网址换成了维基百科的一个表格。

我通过切换到BeautifulSoup4解决了这个问题。

from bs4 import BeautifulSoup

并且把.string改成了.get_text()。

start = cols[1].get_text()

我没法用你的例子进行测试（正如我之前说的，我无法重现那个错误），但我觉得这对正在寻找解决方案的人会有帮助。

回答于 2025-04-15 由 Python大师

分享举报

start = cols[1].find('a').string

或者更简单

start = cols[1].a.string

或者更好

start = str(cols[1].find(text=True))

和

entry = [str(x) for x in cols.findAll(text=True)]

回答于 2025-04-15 由 Python大师

分享举报

BeautifulSoup HTML 表格解析

2 个回答

撰写回答