Python urllib 跳过HTTP或URL上的链接

#!/usr/bin/python #parser.py: Downloads Bibles and parses all data within <article> tags. __author__ = "Cody Bouche" __copyright__ = "Copyright 2012 Digital Bible Society" from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse.urlparse(url).path.split('/')[-1] dirname = urlparse.urlparse(url).path.split('.')[-1] f = urllib2.urlopen(url) s = f.read() if (os.path.isdir(dirname) == 0): os.mkdir(dirname) soup = BeautifulSoup(s) articleTag = soup.html.body.article converted = str(articleTag) full_path = os.path.join(dirname, name) open(full_path, 'wb').write(converted) print(name) print("DOWNLOADS COMPLETE!")

2条回答

网友

1楼 · 编辑于 2024-04-24 14:04:12

试着把你的urlopen线放在Try catch语句下。查一下这个：

在docs.python.org/tutorial/errors.html第8.3节

看看不同的异常，当遇到异常时，使用continue语句重新启动循环

网友

2楼 · 编辑于 2024-04-24 14:04:12

要将超时应用于请求，请将timeout变量添加到对urlopen的调用中。从docs：

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

请参阅本指南中有关how to handle exceptions with urllib2的部分。实际上，我发现整个指南都很有用。在

request timeout异常代码是408。总结一下，如果要处理超时异常，您将：

try:
    response = urlopen(req, 3) # 3 seconds
except URLError, e:
    if hasattr(e, 'code'):
        if e.code==408:
            print 'Timeout ', e.code
        if e.code==404:
            print 'File Not Found ', e.code
        # etc etc

相关问题更多 >

编程相关推荐

热门问题

热门文章