Beautiful Soup和while语句
我正在尝试用以下的BeautifulSoup脚本找到前30个TED视频(视频名称和网址):
import urllib2
from BeautifulSoup import BeautifulSoup
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
soup = BeautifulSoup(page)
link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))
outfile = open("test.html", "w")
print >> outfile, """<head>
<head>
<title>TED Talks Index</title>
</head>
<body>
<br><br><center>
<table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""
print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"
ted_link = 'http://www.ted.com/'
for anchor in link:
print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])
count = count + 1
print >> outfile, """</table>
</body>
</html>"""
page_count = page_count + 1
这段代码看起来还不错,但有两个问题:
计数器似乎没有增加。它只找到第一页的内容,也就是前十个视频,而不是三十个。为什么会这样呢?
这部分代码给我带来了很多错误。我不知道该如何逻辑上实现我想要的功能(使用urlopen("%s%d")):
代码:
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
1 个回答
1
首先,简化一下循环,去掉一些多余的变量,这些在这里其实是多余的代码。
for pagenum in xrange(1, 4): # The 4 is annoying, write it as 3+1 if you like.
url = "http://www.ted.com/talks?page=%d" % pagenum
# do stuff with url
然后,我们在循环外打开文件,而不是每次循环都重新打开一次。这就是为什么你只看到10个结果的原因:你看到的是第11到20个结果,而不是你以为的前十个结果。(如果你没有限制在 page_count < total_pages
,你本来会看到21到30的结果,但这样只处理了前两页。)
接着,一次性收集所有链接,然后再写出结果。我把HTML的样式去掉了,这样代码更容易理解;你可以使用CSS,可能是一个内联的<style>元素,或者如果你想的话再加回来。
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
links = []
for pagenum in xrange(1, 4):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th></tr>"""
for x, a in enumerate(links):
print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))
print >>out, "</table>"
# Or, as an ordered list:
print >>out, "<ol>"
for a in links:
print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"]))
print >>out, "</ol>"
print >>out, "</body></html>"