使用BeautifulSoup从远程网站抓取详情以展示在本地网站上
我刚接触Python和BeautifulSoup,想从一个网站上抓取比赛的详细信息,然后在我本地的俱乐部网站上展示。
这是我目前写的代码:
import urllib2
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)))
from BeautifulSoup import BeautifulSoup
# Road
#cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Road%20Events'
# MTB
cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Mountain%20Biking%20Events'
response = urllib2.urlopen(cyclelab_url)
html = response.read()
soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
txt = event.find(text=True)
print txt
event_details = soup.findAll(attrs= {"class" : "TDText"})
for detail in event_details:
lines=[]
txt_details = detail.find(text=True)
print txt_details
这段代码可以打印出比赛的名称和详细信息,但我想要的效果是,先打印比赛名称,然后在下面打印该比赛的详细信息。看起来这应该很简单,但我现在有点困惑。
2 个回答
0
更新:Mark Longair 给出了更正确的答案。请查看评论。
代码是从上到下执行的。所以在你的代码中,首先会打印出所有的事件,然后再打印细节。你需要把代码“编织”在一起,也就是说,对于每一个事件,先打印出它的所有细节,然后再处理下一个事件。可以试试这样做:
[....]
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
event_details = soup.findAll(attrs= {"class" : "TDText"})
for event in event_names:
txt = event.find(text=True)
print txt
for detail in event_details:
txt_details = detail.find(text=True)
print txt_details
还有一些进一步的改进:你可以用 .strip() 来去掉所有的空格和换行。例如:text_details = detail.find(text=True).strip()
。
4
如果你看看这个页面的结构,你会发现第一个循环中找到的事件名称是被一个表格包裹着的,这个表格里有其他有用的信息,都是成对的单元格,排成行。所以,我的做法是只用一个循环,每次找到事件名称时,就去找包裹它的表格,然后找出这个表格下的所有事件。这样做似乎效果不错:
soup = BeautifulSoup(html)
event_names = soup.findAll(attrs= {"class" : "SpanEventName"})
for event in event_names:
txt = event.find(text=True)
print "Event name: "+txt.strip()
# Find each parent in turn until we find the table that encloses
# the event details:
parent = event.parent
while parent and parent.name != "table":
parent = parent.parent
if not parent:
raise Exception, "Failed to find a <table> enclosing the event"
# Now parent is the table element, so look for every
# row under that table, and then the cells under that:
for row in parent.findAll('tr'):
cells = row.findAll('td')
# We only care about the rows where there is a multiple of two
# cells, since these are the key / value pairs:
if len(cells) % 2 != 0:
continue
for i in xrange(0,len(cells),2):
key_text = cells[i].find(text=True)
value_text = cells[i+1].find(text=True)
if key_text and value_text:
print " Key:",key_text.strip()
print " Value:",value_text.strip()
输出结果看起来是这样的:
Event name: Columbia Grape Escape 2011
Key: Category:
Value: Mountain Biking Events
Key: Event Date:
Value: 4 March 2011 to 6 March 2011
Key: Entries Close:
Value: 31 January 2011 at 23:00
Key: Venue:
Value: Eden on the Bay, Blouberg
Key: Province:
Value: Western Cape
Key: Distance:
Value: 3 Day, 3 Stage Race (228km)
Key: Starting Time:
Value: -1:-1
Key: Timed By:
Value: RaceTec
Event name: Investpro MTB Race 2011
Key: Category:
Value: Mountain Biking Events
Key: Event Date:
Value: 5 March 2011
Key: Entries Close:
Value: 25 February 2011 at 23:00
...等等。