我正在尝试分析来自此网站的信息(html表):http://www.511virginia.org/RoadConditions.aspx?j=All&r=1
目前我正在使用BeautifulSoup,代码如下
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table")
rows = table.findAll('tr')[3]
cols = rows.findAll('td')
roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string
entry = (roadtype, start, end, condition, reason, update)
print entry
问题在于起始列和结束列。它们只是被打印成“无”
输出:
(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')
我知道它们存储在columns列表中,但似乎额外的link标记会干扰原始html的解析,如下所示:
<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td>
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td>
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td>
<td headers="condition" class="ConditionsCellText">Moderate</td>
<td headers="reason" class="ConditionsCellText">snow or ice</td>
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>
因此,应该印刷的是:
(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')
如有任何建议或帮助,我们将不胜感激,并提前向您表示感谢。
我试图重现您的错误,但源html页面已更改。
关于这个错误,我遇到了一个类似的问题,试图重现的例子是here
更改a Wikipedia Table的建议URL
我把它移到美丽的湖畔
并为
.get_text()
更改.string
我无法用你的例子进行测试(正如我之前所说,我无法重现错误),但我认为这对人们正在寻找解决这个问题的方法是有用的。
或者更简单
或者更好
以及
相关问题 更多 >
编程相关推荐