Beautiful Soup和while语句

2 投票
1 回答
740 浏览
提问于 2025-04-16 16:38

我正在尝试用以下的BeautifulSoup脚本找到前30个TED视频(视频名称和网址):

import urllib2
from BeautifulSoup import BeautifulSoup

total_pages = 3
page_count = 1
count = 1

url = 'http://www.ted.com/talks?page='

while page_count < total_pages:

    page = urllib2.urlopen("%s%d") %(url, page_count)

    soup = BeautifulSoup(page)

    link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))

    outfile = open("test.html", "w")

    print >> outfile, """<head>
            <head>
                    <title>TED Talks Index</title>
            </head>

            <body>

            <br><br><center>

            <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""

    print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"

    ted_link = 'http://www.ted.com/'

    for anchor in link:
            print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])

    count = count + 1

    print >> outfile, """</table>
                    </body>
                    </html>"""

    page_count = page_count + 1

这段代码看起来还不错,但有两个问题:

  1. 计数器似乎没有增加。它只找到第一页的内容,也就是前十个视频,而不是三十个。为什么会这样呢?

  2. 这部分代码给我带来了很多错误。我不知道该如何逻辑上实现我想要的功能(使用urlopen("%s%d")):

代码:

total_pages = 3
page_count = 1
count = 1

url = 'http://www.ted.com/talks?page='

while page_count < total_pages:

page = urllib2.urlopen("%s%d") %(url, page_count)

1 个回答

1

首先,简化一下循环,去掉一些多余的变量,这些在这里其实是多余的代码。

for pagenum in xrange(1, 4):  # The 4 is annoying, write it as 3+1 if you like.
  url = "http://www.ted.com/talks?page=%d" % pagenum
  # do stuff with url

然后,我们在循环外打开文件,而不是每次循环都重新打开一次。这就是为什么你只看到10个结果的原因:你看到的是第11到20个结果,而不是你以为的前十个结果。(如果你没有限制在 page_count < total_pages,你本来会看到21到30的结果,但这样只处理了前两页。)

接着,一次性收集所有链接,然后再写出结果。我把HTML的样式去掉了,这样代码更容易理解;你可以使用CSS,可能是一个内联的<style>元素,或者如果你想的话再加回来。

import urllib2
from cgi import escape  # Important!
from BeautifulSoup import BeautifulSoup

def is_talk_anchor(tag):
  return tag.name == "a" and tag.findParent("dt", "thumbnail")
links = []
for pagenum in xrange(1, 4):
  soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
  links.extend(soup.findAll(is_talk_anchor))

out = open("test.html", "w")

print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th></tr>"""

for x, a in enumerate(links):
  print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))

print >>out, "</table>"

# Or, as an ordered list:
print >>out, "<ol>"
for a in links:
  print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"]))
print >>out, "</ol>"

print >>out, "</body></html>"

撰写回答