这个gzip格式有什么问题？

3 投票

3 回答

2941 浏览

提问于 2025-04-16 03:46

我用下面的Python代码从服务器下载带有gzip压缩的网页：

url = "http://www.v-gn.de/wbb/"
import urllib2
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()

import gzip
from StringIO import StringIO
html = gzip.GzipFile(fileobj=StringIO(content)).read()

这个方法通常是有效的，但在特定的URL上会出现一个struct.error的错误。如果我用wget并加上“Accept-encoding”这个头信息，也会得到类似的结果。不过，浏览器似乎能正常解压缩响应。

所以我想问的是：有没有办法让我这段Python代码在不去掉“Accept-encoding”头信息的情况下，解压缩HTTP响应呢？

为了完整起见，这里是我用wget的那行代码：

wget --user-agent="Mozilla" --header="Accept-Encoding: gzip,deflate" http://www.v-gn.de/wbb/

wget 数据压缩网络编程 http响应请求头 gzip 解压缩

3 个回答

你可以通过从urllib2.HTTPHandler这个类派生来创建你的处理器，并重写http_open()方法。

import gzip
from StringIO import StringIO
import httplib, urllib, urllib2
class GzipHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        req.add_header('Accept-encoding', 'gzip')
        r = self.do_open(httplib.HTTPConnection, req)
        if (
            'Content-Encoding'in r.headers and
            r.headers['Content-Encoding'] == 'gzip'
        ):
            fp = gzip.GzipFile(fileobj=StringIO(r.read()))
        else:
            fp = r
        response = urllib.addinfourl(fp, r.headers, r.url, r.code)
        response.msg = r.msg
        return respsone

然后构建你的opener。

def retrieve(url):
    request = urllib2.Request(url)
    opener = urllib2.build_opener(GzipHandler)
    return opener.open(request)

这个方法的不同之处在于，它会检查服务器是否返回了gzip格式的响应，而且这个检查是在请求的时候进行的，而不是请求之后。

想了解更多信息，可以查看：

回答于 2025-04-16 由 Python大师

分享举报

我运行了你指定的命令。它把一个压缩的数据下载到了 index.html 文件里。我把 index.html 改名成了 index.html.gz。然后我试着用 gzip -d index.html.gz 来解压，但出现了一个错误：gzip: index.html.gz: unexpected end of file，意思是文件的结尾有点问题。

第二次我试了 zcat index.html.gz，这个命令成功了，除了在 </html> 标签后面又出现了之前的那个错误。

$ zcat index.html.gz
...
  </td>
 </tr>
</table>


</body>
</html>
gzip: index.html.gz: unexpected end of file

服务器可能有问题。

回答于 2025-04-16 由 Python大师

分享举报

看起来你可以在 gzip.GzipFile 对象上使用 readline() 方法，但如果用 read() 方法的话，会因为文件突然结束而出现 struct.error 的错误。

因为 readline 方法在大部分情况下都能正常工作（除了在文件的最后一部分），你可以尝试这样做：

import urllib2
import StringIO
import gzip
import struct

url = "http://www.v-gn.de/wbb/"
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()
fh=StringIO.StringIO(content)
html = gzip.GzipFile(fileobj=StringIO.StringIO(content))
try:
    for line in html:
        line=line.rstrip()
        print(line)
except struct.error:
    pass

回答于 2025-04-16 由 Python大师

分享举报

这个gzip格式有什么问题？

3 个回答

撰写回答