如何修改这个Python下载函数?

2 投票
3 回答
501 浏览
提问于 2025-04-16 09:52

现在情况有点不稳定。Gzip、图片,有时候它就是不管用。

我该怎么改这个下载功能,让它能处理任何东西?(不管是gzip还是其他什么头信息?)

我怎么才能自动“检测”它是不是gzip?我不想像现在这样总是传递真或假。

def download(source_url, g = False, correct_url = True):
    try:
        socket.setdefaulttimeout(10)
        agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)']
        ree = urllib2.Request(source_url)
        ree.add_header('User-Agent',random.choice(agents))
        ree.add_header('Accept-encoding', 'gzip')
        opener = urllib2.build_opener()
        h = opener.open(ree).read()
        if g:
            compressedstream = StringIO(h)
            gzipper = gzip.GzipFile(fileobj=compressedstream)
            data = gzipper.read()
            return data
        else:
            return h
    except Exception, e:
        return ""

3 个回答

1
import urllib2
import StringIO
import gzip

req = urllib2.Request('http:/foo/')
h = urllib2.urlopen(req)
data = resp.read()
if 'gzip' in resp.headers['Content-Encoding']:
    compressedstream = StringIO(h)
    gzipper = gzip.GzipFile(fileobj=compressedstream)
    data = gzipper.read()

# etc...

当然可以!请把你想要翻译的内容发给我,我会帮你把它变得更简单易懂。

1

要检测你正在下载的数据类型,你应该把 h = opener.open(ree).read() 替换成 h = opener.open(ree)

现在在 h 里你得到的是一个响应对象。你可以通过使用 h.headers(像字典一样的对象)来分析头信息。特别是你会对 'content-type' 和 'content-encoding' 这两个头信息感兴趣。你可以通过分析这些信息来判断正在发送的内容。

def download(source_url, correct_url = True):
    try:
        socket.setdefaulttimeout(10)
        agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)']
        ree = urllib2.Request(source_url)
        ree.add_header('User-Agent',random.choice(agents))
        ree.add_header('Accept-encoding', 'gzip')
        opener = urllib2.build_opener()
        h = opener.open(ree)
        if 'gzip' in h.headers.get('content-type', '') or
           'gzip' in h.headers.get('content-encoding', ''):
            compressedstream = StringIO(h.read())
            gzipper = gzip.GzipFile(fileobj=compressedstream)
            data = gzipper.read()
            return data
        else:
            return h.read()
    except Exception, e:
        return ""
4

检查一下 Content-Encoding 这个头信息:

import urllib2
import socket
import random
import StringIO
import gzip

def download(source_url):
    try:
        socket.setdefaulttimeout(10)
        agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)']
        ree = urllib2.Request(source_url)
        ree.add_header('User-Agent',random.choice(agents))
        ree.add_header('Accept-encoding', 'gzip')
        opener = urllib2.build_opener()
        response = opener.open(ree)
        encoding=response.headers.getheader('Content-Encoding')
        content = response.read()
        if encoding and 'gzip' in encoding:
            compressedstream = StringIO.StringIO(content)
            gzipper = gzip.GzipFile(fileobj=compressedstream)
            data = gzipper.read()
            return data
        else:
            return content
    except urllib2.URLError as e:
        return ""

data=download('http://api.stackoverflow.com/1.0/questions/3708418?type=jsontext')
print(data)

如果你遇到的服务器没有把内容编码报告为 gzip,那你可以更大胆一点,先试试:

def download(source_url):
    try:
        socket.setdefaulttimeout(10)
        agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)']
        ree = urllib2.Request(source_url)
        ree.add_header('User-Agent',random.choice(agents))
        ree.add_header('Accept-encoding', 'gzip')
        opener = urllib2.build_opener()
        response = opener.open(ree)
        content = response.read()
        compressedstream = StringIO.StringIO(content)
        gzipper = gzip.GzipFile(fileobj=compressedstream)
        try:
            data = gzipper.read()
        except IOError:
            data = content
        return data        
    except urllib2.URLError as e:
        return ""

撰写回答