urllib2处理404的try和except

14 投票

3 回答

53922 浏览

提问于 2025-04-17 07:01

我正在尝试使用urlib2访问一系列编号的数据页面。我想用一个try语句来处理，但我对这个语句了解不多。根据我查阅的资料，它似乎是基于一些特定的“名称”，这些名称代表了异常情况，比如IOError等。我不知道我需要关注的错误代码是什么，这也是我遇到的问题之一。

我从《urllib2缺失手册》中写下了我的urllib2页面获取程序，代码如下：

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()

    txheaders =  {'User-agent' : useragent}

    if os.path.isfile(COOKIEFILE):
        cj.load(COOKIEFILE)
        print "previous cookie loaded..."
    else:
        print "no ospath to cookfile"

    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    urllib2.install_opener(opener)
    try:
        req = urllib2.Request(url, useragent)
        # create a request object

        handle = urlopen(req)
        # and open it to return a handle on the url

    except IOError, e:
        print 'Failed to open "%s".' % url
        if hasattr(e, 'code'):
            print 'We failed with error code - %s.' % e.code
        elif hasattr(e, 'reason'):
            print "The error object has the following 'reason' attribute :"
            print e.reason
            print "This usually means the server doesn't exist,",
            print "is down, or we don't have an internet connection."
            return False

    else:
        print
        if cj is None:
            print "We don't have a cookie library available - sorry."
            print "I can't show you any cookies."
        else:
            print 'These are the cookies we have received so far :'
            for index, cookie in enumerate(cj):
                print index, '  :  ', cookie
                cj.save(COOKIEFILE)           # save the cookies again

        page = handle.read()
        return (page)

def fetch_series():

  useragent="Firefox...etc."
  url="www.example.com/01.html"
  try:
    fetch_page(url,useragent)
  except [something]:
    print "failed to get page"
    sys.exit()

下面的函数只是一个示例，想让大家明白我的意思。有人能告诉我应该在这里放什么吗？我让页面获取函数在遇到404错误时返回False，这样做对吗？那么为什么except False:不管用呢？感谢你们的帮助。

好的，根据这里的建议，我尝试了：

except urlib2.URLError, e:

except URLError, e:

except URLError:

except urllib2.IOError, e:

except IOError, e:

except IOError:

except urllib2.HTTPError, e:

except urllib2.HTTPError:

except HTTPError:

但都没有成功。

异常处理 urllib2 网络请求 404错误错误代码 try-except 数据页面页面获取

3 个回答

互动调试：

想要了解Python中异常的性质和可能的内容，最简单的方法就是在交互式环境中尝试一些关键调用：

>>> f = urllib2.urlopen('http://httpbin.org/status/404')
Traceback (most recent call last):
...
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: NOT FOUND

然后，sys.last_value会包含最后发生的异常值，这个值可以在交互式环境中进行操作：
（可以使用TAB键加上点号来自动扩展，使用dir()、vars()等命令……）

>>> ev = sys.last_value
>>> ev.__class__
<class 'urllib2.HTTPError'>
>>> dir(ev)
['_HTTPError__super_init', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', 'args', 'close', 'code', 'errno', 'filename', 'fileno', 'fp', 'getcode', 'geturl', 'hdrs', 'headers', 'info', 'message', 'msg', 'next', 'read', 'readline', 'readlines', 'reason', 'strerror', 'url']
>>> vars(ev)
{'fp': <addinfourl at 140193880 whose fp = <socket._fileobject object at 0x01062370>>, 'fileno': <bound method _fileobject.fileno of <socket._fileobject object at 0x01062370>>, 'code': 404, 'hdrs': <httplib.HTTPMessage instance at 0x085ADF80>, 'read': <bound method _fileobject.read of <socket._fileobject object at 0x01062370>>, 'readlines': <bound method _fileobject.readlines of <socket._fileobject object at 0x01062370>>, 'next': <bound method _fileobject.next of <socket._fileobject object at 0x01062370>>, 'headers': <httplib.HTTPMessage instance at 0x085ADF80>, '__iter__': <bound method _fileobject.__iter__ of <socket._fileobject object at 0x01062370>>, 'url': 'http://httpbin.org/status/404', 'msg': 'NOT FOUND', 'readline': <bound method _fileobject.readline of <socket._fileobject object at 0x01062370>>}
>>> sys.last_value.code
404

试着处理一下：

>>> try: f = urllib2.urlopen('http://httpbin.org/status/404')
... except urllib2.HTTPError, ev:
...     print ev, "'s error code is", ev.code
...     
HTTP Error 404: NOT FOUND 's error code is 404

构建一个不会抛出HTTP错误的简单打开器：

>>> ho = urllib2.OpenerDirector()
>>> ho.add_handler(urllib2.HTTPHandler())
>>> f = ho.open('http://localhost:8080/cgi/somescript.py'); f
<addinfourl at 138851272 whose fp = <socket._fileobject object at 0x01062370>>
>>> f.code
500
>>> f.read()
'Execution error: <pre style="background-color:#faa">\nNameError: name \'e\' is not defined\n<pre>\n'

urllib2.build_opener的默认处理器：

default_classes = [ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor]

回答于 2025-04-17 由 Python大师

分享举报

如果你想要检测到404错误（也就是找不到页面），你应该捕获 urllib2.HTTPError。

try:
    req = urllib2.Request(url, useragent)
    # create a request object

    handle = urllib2.urlopen(req)
    # and open it to return a handle on the url
except urllib2.HTTPError, e:
    print 'We failed with error code - %s.' % e.code

    if e.code == 404:
        # do stuff..  
    else:
        # other stuff...

    return False
else:
    # ...

在 fetch_series() 函数中捕获这个错误：

def fetch_page(url,useragent)
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    try:
        urlopen()
        #...
    except IOError, e:
        # ...   
    else:
        #...

def fetch_series(): 
    useragent=”Firefox...etc.”
    url=”www.example.com/01.html
    try:
        fetch_page(url,useragent)
    except urllib2.HTTPError, e:
        print “failed to get page”

http://docs.python.org/library/urllib2.html:

exception urllib2.HTTPError
尽管 HTTPError 是一种异常（它是 URLError 的子类），但它也可以像普通的文件返回值一样使用（就像 urlopen() 返回的那样）。这在处理一些特殊的HTTP错误时很有用，比如需要身份验证的请求。

code
这是一个HTTP状态码，具体定义在RFC 2616中。这个数字对应于在 BaseHTTPServer.BaseHTTPRequestHandler.responses 中找到的状态码字典里的值。

回答于 2025-04-17 由 Python大师

分享举报

我建议你看看这个很棒的requests模块。

使用这个模块，你可以实现你所提到的功能，方法如下：

import requests
from requests.exceptions import HTTPError

try:
    r = requests.get('http://httpbin.org/status/200')
    r.raise_for_status()
except HTTPError:
    print 'Could not download page'
else:
    print r.url, 'downloaded successfully'

try:
    r = requests.get('http://httpbin.org/status/404')
    r.raise_for_status()
except HTTPError:
    print 'Could not download', r.url
else:
    print r.url, 'downloaded successfully'

回答于 2025-04-17 由 Python大师

分享举报

urllib2处理404的try和except

3 个回答

互动调试：

构建一个不会抛出HTTP错误的简单打开器：

撰写回答