使用Beautiful Soup抓取多个页面进行解析

0 投票

3 回答

4576 浏览

提问于 2025-04-17 07:25

我想从一个网站上抓取多个页面，然后用BeautifulSoup来解析这些页面。目前，我尝试使用urllib2来实现这个目标，但遇到了一些问题。我尝试的代码是：

import urllib2,sys
from BeautifulSoup import BeautifulSoup

for numb in ('85753', '87433'):
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)

title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]

first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]

print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)

这段代码只给我返回了序列中的第二个数字的结果，也就是这个链接：http://www.presidency.ucsb.edu/ws/index.php?pid=87433。我也尝试过使用mechanize，但没有成功。理想情况下，我希望能有一个页面，上面有一系列链接，然后自动选择一个链接，把HTML内容交给BeautifulSoup处理，然后再继续下一个链接。

数据提取 urllib2 网页抓取自动化 beautiful soup mechanize html 解析链接处理

3 个回答

我觉得你在循环里面的缩进没有对齐：

import urllib2,sys
from BeautifulSoup import BeautifulSoup

for numb in ('85753', '87433'):
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)

    title = soup.find("span", {"class":"paperstitle"})
    date = soup.find("span", {"class":"docdate"})
    span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
    paras = [x for x in span.findAllNext("p")]

    first = title.string
    second = date.string
    start = span.string
    middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
    last = paras[-1].contents[0]

    print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)

我觉得这样做应该能解决问题。

回答于 2025-04-17 由 Python大师

分享举报

这里有一个更整洁的解决方案（使用lxml库）：

import lxml.html as lh

root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid='
page_ids = ['85753', '87433']

def scrape_page(page_id):
    url = root_url + page_id
    tree = lh.parse(url)

    title = tree.xpath("//span[@class='paperstitle']")[0].text
    date = tree.xpath("//span[@class='docdate']")[0].text
    text = tree.xpath("//span[@class='displaytext']")[0].text_content()

    return title, date, text

if __name__ == '__main__':
    for page_id in page_ids:
        title, date, text = scrape_page(page_id)

回答于 2025-04-17 由 Python大师

分享举报

你需要把其余的代码放到循环里面。现在你在遍历这个元组里的两个项目，但在循环结束时，只有最后一个项目被赋值给 address，而这个赋值是在循环外进行的。

回答于 2025-04-17 由 Python大师

分享举报

使用Beautiful Soup抓取多个页面进行解析

3 个回答

撰写回答