wget与请求.get关于文件下载

2024-04-19 22:09:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有个奇怪的错误。dropbox上有一个文件,我正在下载,其中包含以下python代码:

import requests
import shutil

url = 'https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0'
r = requests.get(url, stream=True)
path_to_save = "/tmp/data.dload-1"
with open(path_to_save, 'wb') as f:
    shutil.copyfileobj(r.raw, f)  

下载到/tmp/data.dload-1。在

用wget wget https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0 -O /tmp/data.dload-2下载了相同的文件

这两个文件类型相同:

^{pr2}$

但对它们进行去皮处理会产生不同的结果:

(dl)x:x$ tar -zxvf /tmp/data.dload-1
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(dl) x:x$ tar -zxvf /tmp/data.dload-2
testfiles/a
testfiles/b
(dl)x:x$ 

任何人都知道为什么会发生这种情况,更重要的是我如何用Python(最好是requests)下载tar文件

这是r.headers的结果: (dl) x:x$ python dload-test.py {'Server': 'nginx', 'Date': 'Fri, 27 Apr 2018 17:27:06 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'no-cache', 'Content-Security-Policy': "script-src 'unsafe-eval' https://www.dropbox.com/static/compiled/js/ https://www.dropbox.com/static/javascript/ https://www.dropbox.com/static/api/ https://cfl.dropboxstatic.com/static/compiled/js/ https://www.dropboxstatic.com/static/compiled/js/ https://cfl.dropboxstatic.com/static/js/ https://www.dropboxstatic.com/static/js/ https://cfl.dropboxstatic.com/static/previews/ https://www.dropboxstatic.com/static/previews/ https://cfl.dropboxstatic.com/static/api/ https://www.dropboxstatic.com/static/api/ https://cfl.dropboxstatic.com/static/cms/ https://www.dropboxstatic.com/static/cms/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ 'unsafe-inline' ; img-src https://* data: blob: ; frame-ancestors 'self' ; default-src 'none' ; frame-src https://* carousel://* dbapi-6://* dbapi-7://* dbapi-8://* itms-apps://* itms-appss://* ; worker-src https://www.dropbox.com/static/serviceworker/ blob: ; style-src https://* 'unsafe-inline' 'unsafe-eval' ; connect-src https://* ws://127.0.0.1:*/ws ; object-src 'self' https://cfl.dropboxstatic.com/static/ https://www.dropboxstatic.com/static/ https://flash.dropboxstatic.com https://swf.dropboxstatic.com https://dbxlocal.dropboxstatic.com ; media-src https://* blob: ; font-src https://* data: ; child-src https://www.dropbox.com/static/serviceworker/ blob: ; form-action 'self' https://www.dropbox.com/ https://dl-web.dropbox.com/ https://photos.dropbox.com/ https://accounts.google.com/ https://api.login.yahoo.com/ https://login.yahoo.com/ ; base-uri 'self' api-stream.dropbox.com showbox-tr.dropbox.com ; report-uri https://www.dropbox.com/csp_log", 'Dropbox-Streaming': 'V=1', 'Pragma': 'no-cache', 'Referrer-Policy': 'origin-when-cross-origin', 'Set-Cookie': 'locale=en; Domain=dropbox.com; expires=Wed, 26 Apr 2023 17:27:06 GMT; Path=/; secure, gvc=OTU0NjExNzUwNjc0NjQxNzgwMzE0OTgzMzkzNjc3MzM5OTYzNzc%3D; expires=Wed, 26 Apr 2023 17:27:06 GMT; httponly; Path=/; secure, flash=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, puc=; expires=Fri, 27 Apr 2018 17:27:06 GMT; httponly; Path=/; secure, bang=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, seen-sl-signup-modal=VHJ1ZQ%3D%3D; expires=Sun, 27 May 2018 17:27:06 GMT; httponly; Path=/; secure, t=HlsAKcFI_HJWteio0_5ELyFf; Domain=dropbox.com; expires=Mon, 26 Apr 2021 17:27:06 GMT; httponly; Path=/; secure, __Host-js_csrf=HlsAKcFI_HJWteio0_5ELyFf; expires=Mon, 26 Apr 2021 17:27:06 GMT; Path=/; secure', 'X-Content-Type-Options': 'nosniff', 'X-Dropbox-Request-Id': 'b028e94ce7b814c7f25fb753449b641a', 'X-Frame-Options': 'DENY', 'X-Robots-Tag': 'noindex, nofollow, noimageindex', 'X-Xss-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Content-Encoding': 'gzip'}


Tags: pathhttpssrccomdatawwwstatictar
2条回答

文件被gzip压缩的问题,即使它已经是gzip压缩的文件(可以从r.headers中的'Content-Encoding': 'gzip'字段中看到)。在

对于requests和{},您使用的是默认的请求头。默认情况下,它们都将发送类似'Accept-Encoding: gzip, deflate'的内容。(如果您打印出r.request.headers),那么服务器可以轻松地gzip文件并用'Content-Encoding: gzip'头将其发送回。在

默认情况下,wgetrequests都将检测到该报头并透明地为您解码数据,但您已经明确地告诉requests不要这样做,并按原样读取原始数据。在

所以最后保存的文件是gzip压缩的gzip压缩tarball。显然,file将报告为gzip compressed data,而{}将报告gzip does not look like a tar archive中的内容,因为它不是,它是一个gzip压缩的tar存档。在

这里最小的修正是手动将headers={'Accept-Encoding': 'identity'}添加到您的请求中。在


您可能会想知道,为什么服务器会费心压缩gzip压缩gzip文件,只是因为您告诉它您可以接受gzip并不意味着您需要gzip,对吗?在

如果您看一下RFC 2616RFC 7231,服务器应该选择它可以支持的、具有最大qvalue(权重)的编码(根据一些未指定的启发式方法断开连接)。如果您的用户代理显式地请求'gzip, deflate',那么给您{}将是不正确的,除非实际上不可能这样做,而不是有点傻。在

这太疯狂了,但是将URL末尾的0改为1是可行的。从这个SO post出发。在

相关问题 更多 >