如何在urllib.urlretrieve中捕获404错误

30 投票

4 回答

29060 浏览

提问于 2025-04-15 13:45

背景：我在使用 urllib.urlretrieve，而不是 urllib* 模块里的其他函数，因为它支持一个叫做钩子函数的功能（下面会提到 reporthook），可以用来显示文本进度条。这是适用于 Python 版本大于等于 2.6 的。

>>> urllib.urlretrieve(url[, filename[, reporthook[, data]]])

不过，urlretrieve 有点笨，它没有办法让我们知道 HTTP 请求的状态（比如：是 404 还是 200？）。

>>> fn, h = urllib.urlretrieve('http://google.com/foo/bar')
>>> h.items() 
[('date', 'Thu, 20 Aug 2009 20:07:40 GMT'),
 ('expires', '-1'),
 ('content-type', 'text/html; charset=ISO-8859-1'),
 ('server', 'gws'),
 ('cache-control', 'private, max-age=0')]
>>> h.status
''
>>>

有没有什么好的方法可以下载远程的 HTTP 文件，同时又能支持类似钩子的功能（显示进度条）和合理的 HTTP 错误处理？

错误处理文件下载网络编程 http请求进度条 urllib 钩子函数 404错误

4 个回答

URL打开器对象的“retrieve”方法支持报告钩子，并在遇到404错误时会抛出异常。

http://docs.python.org/library/urllib.html#url-opener-objects

回答于 2025-04-15 由 Python大师

分享举报

你应该使用：

import urllib2

try:
    resp = urllib2.urlopen("http://www.google.com/this-gives-a-404/")
except urllib2.URLError, e:
    if not hasattr(e, "code"):
        raise
    resp = e

print "Gave", resp.code, resp.msg
print "=" * 80
print resp.read(80)

补充说明：这里的意思是，如果你没有预料到会出现异常情况，那这种情况就算是个例外。你可能根本没考虑到这一点。所以，与其让你的代码在失败后继续运行，不如默认让它停止执行，这样做是很合理的。

回答于 2025-04-15 由 Python大师

分享举报

看看 urllib.urlretrieve 的完整代码：

def urlretrieve(url, filename=None, reporthook=None, data=None):
  global _urlopener
  if not _urlopener:
    _urlopener = FancyURLopener()
  return _urlopener.retrieve(url, filename, reporthook, data)

换句话说，你可以使用 urllib.FancyURLopener（这是 urllib 的公共接口的一部分）。你可以重写 http_error_default 来检测 404 错误：

class MyURLopener(urllib.FancyURLopener):
  def http_error_default(self, url, fp, errcode, errmsg, headers):
    # handle errors the way you'd like to

fn, h = MyURLopener().retrieve(url, reporthook=my_report_hook)

回答于 2025-04-15 由 Python大师

分享举报

如何在urllib.urlretrieve中捕获404错误

4 个回答

撰写回答