wget与Python的urlretrieve对比

9 投票

8 回答

38493 浏览

数据工程师

提问于 2025-04-15 12:09

我有一个任务，要从一个网站下载几GB的数据。这些数据是以.gz文件的形式存在，每个文件大约45MB。

获取这些文件的简单方法是使用“wget -r -np -A files url”。这个命令会以递归的方式下载数据，并且会把网站的内容镜像下来。下载速度非常快，大约是每秒4MB。

不过，我也想试试用Python来自己写一个网址解析器。

通过Python的urlretrieve下载数据非常慢，可能比wget慢了4倍，下载速度只有每秒500KB。我使用HTMLParser来解析href标签。

我不太明白为什么会这样。有没有什么设置可以调整？

谢谢

wget 文件解析数据下载 htmlparser urlretrieve 递归下载下载速度网站镜像

8 个回答

传输速度有时候会让人感到困惑。你可以试试下面这个脚本，它会用 wget 和 urllib.urlretrieve 两种方式下载同一个网址——建议你多跑几次，因为如果你在使用代理的话，第二次下载可能会因为缓存而速度更快。

对于小文件来说，使用 wget 可能会稍微慢一点，因为它需要启动一个外部程序，但对于大文件来说，这个时间就不那么重要了。

from time import time
import urllib
import subprocess

target = "http://example.com" # change this to a more useful URL

wget_start = time()

proc = subprocess.Popen(["wget", target])
proc.communicate()

wget_end = time()


url_start = time()
urllib.urlretrieve(target)
url_end = time()

print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s"  % (url_end - url_start)

回答于 2025-04-15 由 Python大师

分享举报

urllib对我来说和wget一样快。试试这段代码，它会像wget一样显示进度百分比。

import sys, urllib
def reporthook(a,b,c): 
    # ',' at the end of the line is important!
    print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
    #you can also use sys.stdout.write
    #sys.stdout.write("\r% 3.1f%% of %d bytes" 
    #                 % (min(100, float(a * b) / c * 100), c)
    sys.stdout.flush()
for url in sys.argv[1:]:
     i = url.rfind('/')
     file = url[i+1:]
     print url, "->", file
     urllib.urlretrieve(url, file, reporthook)

回答于 2025-04-15 由 Python大师

分享举报

可能是你在单位换算上出了点错误。

我注意到 500KB/s（千字节每秒）等于4Mb/s（兆位每秒）。

回答于 2025-04-15 由 Python大师

分享举报

wget与Python的urlretrieve对比

8 个回答

撰写回答