Python循环下载多个文件

2024-05-15 13:20:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我的代码有问题。

#!/usr/bin/env python3.1

import urllib.request;

# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';

URL = "www.example.com/img";
req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});

# Counter for the filename.
i = 0;

while True:
    fname =  str(i).zfill(3) + '.png';
    req.full_url = URL + fname;

    f = open(fname, 'wb');

    try:
        response = urllib.request.urlopen(req);
    except:
        break;
    else:
        f.write(response.read());
        i+=1;
        response.close();
    finally:
        f.close();

当我创建urllib.request.request对象(称为req)时,问题似乎就来了。我用一个不存在的url创建它,但后来我将url改为它应该是什么。我这样做是为了使用相同的urllib.request.request对象,而不必在每次迭代中创建新的对象。在python中可能有一种机制可以做到这一点,但我不确定它是什么。

编辑 错误消息是:

>>> response = urllib.request.urlopen(req);
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.1/urllib/request.py", line 121, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python3.1/urllib/request.py", line 356, in open
    response = meth(req, response)
  File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.1/urllib/request.py", line 394, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.1/urllib/request.py", line 328, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.1/urllib/request.py", line 476, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

编辑2:我的解决方案如下。可能在一开始就应该这样做,因为我知道这会奏效:

import urllib.request;

# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';

# Counter for the filename.
i = 0;

while True:
    fname =  str(i).zfill(3) + '.png';
    URL = "www.example.com/img" + fname;

    f = open(fname, 'wb');

    try:
        req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});
        response = urllib.request.urlopen(req);
    except:
        break;
    else:
        f.write(response.read());
        i+=1;
        response.close();
    finally:
        f.close();

Tags: inpyurlresponserequestwindowslibusr
3条回答

对于只需要进行一个或两个网络交互的小脚本来说,urllib2是很好的,但是如果您要做更多的工作,您可能会发现,^{},或者^{}(这并不是巧合地建立在前者的基础上)可能更适合您的需要。您的特定示例可能如下所示:

from itertools import count
import requests

HEADERS = {'user-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
URL = "http://www.example.com/img%03d.png"

# with a session, we get keep alive
session = requests.session()

for n in count():
    full_url = URL % n
    ignored, filename = URL.rsplit('/', 1)

    with file(filename, 'wb') as outfile:
        response = session.get(full_url, headers=HEADERS)
        if not response.ok:
            break
        outfile.write(response.content)

编辑:如果可以使用常规HTTP身份验证(强烈建议使用403 Forbidden响应),则可以使用auth参数将其添加到requests.get中,如下所示:

response = session.get(full_url, headers=HEADERS, auth=('username','password))

当你收到一个异常时不要打断。改变

except:
    break

except:
    #Probably should log some debug information here.
    pass

这将跳过所有有问题的请求,这样就不会影响整个过程。

如果要对每个请求使用自定义用户代理,可以将FancyURLopener子类化。

下面是一个例子:http://wolfprojects.altervista.org/changeua.php

相关问题 更多 >