pythonurllib2在检测重定向时对某些URL抛出“未知URL类型”错误

2024-05-14 16:41:24 发布

您现在位置:Python中文网/ 问答频道 /正文

当我尝试使用urllib2加载以下URL时,所有操作都成功:

# -*- coding: utf-8 -*-

import traceback
import urllib2
import httplib

url = 'http://www.marchofdimes.com/pregnancy/preterm-labor-and-birth.aspx'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
    #'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'gzip, deflate',
   'Accept-Language': 'en-US,en;q=0.8',
   'Connection': 'keep-alive'}

request = urllib2.Request(url, headers=HEADERS)
try:

    response = urllib2.urlopen(request)
    response_header = response.info()
    print "Success: %s - %s"%(response.code, response_header)

except urllib2.HTTPError, e:
    print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
except urllib2.URLError, e:
    print "Unknown URLError: %s"%(e.reason)
except httplib.BadStatusLine as e:
    print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
except Exception:
    print "Unkown Exception: %s"%(traceback.format_exc())

该输出:

^{pr2}$

现在,我尝试用一个打开器加载URL,它会在过程中停止,以便检测重定向:

... Same as above ...

class NoRedirection(urllib2.HTTPErrorProcessor):
    def http_response(self, request, response):
        return response
    https_response = http_response   

def load_url(url):
    print "====== Loading %s ======"%url
    request = urllib2.Request(url, headers=HEADERS)
    try:          
        opener = urllib2.build_opener(NoRedirection)    
        request = urllib2.Request(url, headers=HEADERS)
        response = opener.open(request)
        response_header = response.info()
        ending_url = response_header.getheader('Location') or url
        print "Success: %s - %s"%(response.code, response_header)
        has_redirect = url != ending_url
        if has_redirect:
            load_url(ending_url)
    except urllib2.HTTPError, e:
        print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
    except urllib2.URLError, e:
        print "Unknown URLError: %s"%(e.reason)
    except httplib.BadStatusLine as e:
        print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
    except Exception:            
        print "Unkown Exception: %s"%(traceback.format_exc())

load_url(url)

运行时,输出:

====== Loading http://www.marchofdimes.com/pregnancy/preterm-labor-and-birth.aspx ======
Success: 301 - Content-Type: text/html; charset=UTF-8
Location: http://www.marchofdimes.org/pregnancy/preterm-labor-and-birth.aspx
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Fri, 23 Jan 2015 23:36:58 GMT
Connection: close
Content-Length: 189

====== Loading http://www.marchofdimes.org/pregnancy/preterm-labor-and-birth.aspx ======
Success: 302 - Location: /404.aspx?aspxerrorpath=/pregnancy/preterm-labor-and-birth.aspx
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
X-UA-Compatible: IE=edge
Date: Fri, 23 Jan 2015 23:36:59 GMT
Connection: close
Content-Length: 180

====== Loading /404.aspx?aspxerrorpath=/pregnancy/preterm-labor-and-birth.aspx ======
Unkown Exception: Traceback (most recent call last):
  File "urltest.py", line 32, in load_url
    response = opener.open(request)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 396, in open
    protocol = req.get_type()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 258, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: /404.aspx?aspxerrorpath=/pregnancy/preterm-labor-and-birth.aspx

这个重定向检测已经与所有其他url一起工作了,所以我很困惑为什么它不能与这个url一起工作。在


Tags: andhttpurlresponserequesturllib2headersheader

热门问题