Python：如何从urllib2.urlopen调用中获取HTTP头？

51 投票

6 回答

115188 浏览

提问于 2025-04-15 11:29

当我们使用 urlopen 这个函数时，urllib2 是不是会把整个网页都下载下来呢？

我其实只想查看一下HTTP响应头，而不想获取整个网页内容。看起来 urllib2 是先打开HTTP连接，然后再获取实际的HTML网页... 还是说它只是开始缓存网页内容呢，都是在调用 urlopen 的时候发生的？

import urllib2
myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/'
page = urllib2.urlopen(myurl) // open connection, get headers

html = page.readlines()  // stream page

6 个回答

其实，urllib2可以发送HTTP的HEAD请求。

上面提到的问题展示了如何让urllib2发送HEAD请求。

我来简单说一下：

import urllib2

# Derive from Request class and override get_method to allow a HEAD request.
class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

myurl = 'http://bit.ly/doFeT'
request = HeadRequest(myurl)

try:
    response = urllib2.urlopen(request)
    response_headers = response.info()

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response_headers.dict

except urllib2.HTTPError, e:
    # Prints the HTTP Status code of the response but only if there was a 
    # problem.
    print ("Error code: %s" % e.code)

如果你用像Wireshark这样的网络协议分析工具检查一下，你会发现它实际上发送的是HEAD请求，而不是GET请求。

下面是上面代码的HTTP请求和响应，Wireshark捕获到的内容：

HEAD /doFeT HTTP/1.1
Accept-Encoding: identity
Host: bit.ly
Connection: close
User-Agent: Python-urllib/2.7

HTTP/1.1 301 Moved
Server: nginx
Date: Sun, 19 Feb 2012 13:20:56 GMT
Content-Type: text/html; charset=utf-8
Cache-control: private; max-age=90
Location: http://www.kidsidebyside.org/?p=445
MIME-Version: 1.0
Content-Length: 127
Connection: close
Set-Cookie: _bit=4f40f738-00153-02ed0-421cf10a;domain=.bit.ly;expires=Fri Aug 17 13:20:56 2012;path=/; HttpOnly

不过，正如其他问题中的评论提到的，如果请求的URL有重定向，urllib2会对目标地址发送GET请求，而不是HEAD请求。如果你真的只想发送HEAD请求，这可能是个大问题。

上面的请求涉及了重定向。下面是Wireshark捕获到的对目标地址的请求：

GET /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Accept-Encoding: identity
Host: www.kidsidebyside.org
Connection: close
User-Agent: Python-urllib/2.7

如果不想用urllib2，可以试试Joe Gregorio的httplib2库：

import httplib2

url = "http://bit.ly/doFeT"
http_interface = httplib2.Http()

try:
    response, content = http_interface.request(url, method="HEAD")
    print ("Response status: %d - %s" % (response.status, response.reason))

    # This will just display all the dictionary key-value pairs.  Replace this
    # line with something useful.
    response.__dict__

except httplib2.ServerNotFoundError, e:
    print (e.message)

这个库的好处是，它在初始HTTP请求和重定向请求时都使用HEAD请求。

这是第一次请求：

HEAD /doFeT HTTP/1.1
Host: bit.ly
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)

这是对目标的第二次请求：

HEAD /2009/05/come-and-draw-the-circle-of-unity-with-us/ HTTP/1.1
Host: www.kidsidebyside.org
accept-encoding: gzip, deflate
user-agent: Python-httplib2/0.7.2 (gzip)

回答于 2025-04-15 由 Python大师

分享举报

那如果发送一个HEAD请求，而不是普通的GET请求呢？下面这段代码（从一个类似的问题复制过来的）就是用来实现这个的。

>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> print res.getheaders()
[('content-length', '0'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Sat, 20 Sep 2008 06:43:36 GMT'), ('content-type', 'text/html; charset=ISO-8859-1')]

回答于 2025-04-15 由 Python大师

分享举报

使用 response.info() 方法可以获取响应的头信息。

根据 urllib2 的文档：

urllib2.urlopen(url[, data][, timeout])

...

这个函数会返回一个像文件一样的对象，并且有两个额外的方法：

geturl() — 返回获取到的资源的 URL，通常用来判断是否进行了重定向

info() — 返回页面的元信息，比如头信息，以 httplib.HTTPMessage 实例的形式呈现（可以参考 HTTP 头信息的快速参考）

所以，针对你的例子，可以尝试查看 response.info().headers 的结果，看看你需要的信息。

需要注意的是，使用 httplib.HTTPMessage 有一个重要的警告，详细信息可以查看 python issue 4773。

回答于 2025-04-15 由 Python大师

分享举报

Python：如何从urllib2.urlopen调用中获取HTTP头？

6 个回答

撰写回答