如果连接超时或无效,如何修改脚本以跳过URL/404?在
Python
#!/usr/bin/python
#parser.py: Downloads Bibles and parses all data within <article> tags.
__author__ = "Cody Bouche"
__copyright__ = "Copyright 2012 Digital Bible Society"
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS COMPLETE!")
试着把你的urlopen线放在Try catch语句下。查一下这个:
在docs.python.org/tutorial/errors.html第8.3节
看看不同的异常,当遇到异常时,使用continue语句重新启动循环
要将超时应用于请求,请将
timeout
变量添加到对urlopen
的调用中。从docs:请参阅本指南中有关how to handle exceptions with urllib2的部分。实际上,我发现整个指南都很有用。在
request timeout
异常代码是408
。总结一下,如果要处理超时异常,您将:相关问题 更多 >
编程相关推荐