使用Beautiful Soup抓取多个页面进行解析
我想从一个网站上抓取多个页面,然后用BeautifulSoup来解析这些页面。目前,我尝试使用urllib2来实现这个目标,但遇到了一些问题。我尝试的代码是:
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
这段代码只给我返回了
3 个回答
1
我觉得你在循环里面的缩进没有对齐:
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
我觉得这样做应该能解决问题。
1
这里有一个更整洁的解决方案(使用lxml库):
import lxml.html as lh
root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid='
page_ids = ['85753', '87433']
def scrape_page(page_id):
url = root_url + page_id
tree = lh.parse(url)
title = tree.xpath("//span[@class='paperstitle']")[0].text
date = tree.xpath("//span[@class='docdate']")[0].text
text = tree.xpath("//span[@class='displaytext']")[0].text_content()
return title, date, text
if __name__ == '__main__':
for page_id in page_ids:
title, date, text = scrape_page(page_id)
1
你需要把其余的代码放到循环里面。现在你在遍历这个元组里的两个项目,但在循环结束时,只有最后一个项目被赋值给 address
,而这个赋值是在循环外进行的。