我有个奇怪的错误。dropbox上有一个文件,我正在下载,其中包含以下python代码:
import requests
import shutil
url = 'https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0'
r = requests.get(url, stream=True)
path_to_save = "/tmp/data.dload-1"
with open(path_to_save, 'wb') as f:
shutil.copyfileobj(r.raw, f)
下载到/tmp/data.dload-1
。在
用wget wget https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0 -O /tmp/data.dload-2
下载了相同的文件
这两个文件类型相同:
^{pr2}$但对它们进行去皮处理会产生不同的结果:
(dl)x:x$ tar -zxvf /tmp/data.dload-1
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(dl) x:x$ tar -zxvf /tmp/data.dload-2
testfiles/a
testfiles/b
(dl)x:x$
任何人都知道为什么会发生这种情况,更重要的是我如何用Python
(最好是requests
)下载tar文件
这是r.headers
的结果:
(dl) x:x$ python dload-test.py
{'Server': 'nginx', 'Date': 'Fri, 27 Apr 2018 17:27:06 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'no-cache', 'Content-Security-Policy': "script-src 'unsafe-eval' https://www.dropbox.com/static/compiled/js/ https://www.dropbox.com/static/javascript/ https://www.dropbox.com/static/api/ https://cfl.dropboxstatic.com/static/compiled/js/ https://www.dropboxstatic.com/static/compiled/js/ https://cfl.dropboxstatic.com/static/js/ https://www.dropboxstatic.com/static/js/ https://cfl.dropboxstatic.com/static/previews/ https://www.dropboxstatic.com/static/previews/ https://cfl.dropboxstatic.com/static/api/ https://www.dropboxstatic.com/static/api/ https://cfl.dropboxstatic.com/static/cms/ https://www.dropboxstatic.com/static/cms/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ 'unsafe-inline' ; img-src https://* data: blob: ; frame-ancestors 'self' ; default-src 'none' ; frame-src https://* carousel://* dbapi-6://* dbapi-7://* dbapi-8://* itms-apps://* itms-appss://* ; worker-src https://www.dropbox.com/static/serviceworker/ blob: ; style-src https://* 'unsafe-inline' 'unsafe-eval' ; connect-src https://* ws://127.0.0.1:*/ws ; object-src 'self' https://cfl.dropboxstatic.com/static/ https://www.dropboxstatic.com/static/ https://flash.dropboxstatic.com https://swf.dropboxstatic.com https://dbxlocal.dropboxstatic.com ; media-src https://* blob: ; font-src https://* data: ; child-src https://www.dropbox.com/static/serviceworker/ blob: ; form-action 'self' https://www.dropbox.com/ https://dl-web.dropbox.com/ https://photos.dropbox.com/ https://accounts.google.com/ https://api.login.yahoo.com/ https://login.yahoo.com/ ; base-uri 'self' api-stream.dropbox.com showbox-tr.dropbox.com ; report-uri https://www.dropbox.com/csp_log", 'Dropbox-Streaming': 'V=1', 'Pragma': 'no-cache', 'Referrer-Policy': 'origin-when-cross-origin', 'Set-Cookie': 'locale=en; Domain=dropbox.com; expires=Wed, 26 Apr 2023 17:27:06 GMT; Path=/; secure, gvc=OTU0NjExNzUwNjc0NjQxNzgwMzE0OTgzMzkzNjc3MzM5OTYzNzc%3D; expires=Wed, 26 Apr 2023 17:27:06 GMT; httponly; Path=/; secure, flash=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, puc=; expires=Fri, 27 Apr 2018 17:27:06 GMT; httponly; Path=/; secure, bang=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, seen-sl-signup-modal=VHJ1ZQ%3D%3D; expires=Sun, 27 May 2018 17:27:06 GMT; httponly; Path=/; secure, t=HlsAKcFI_HJWteio0_5ELyFf; Domain=dropbox.com; expires=Mon, 26 Apr 2021 17:27:06 GMT; httponly; Path=/; secure, __Host-js_csrf=HlsAKcFI_HJWteio0_5ELyFf; expires=Mon, 26 Apr 2021 17:27:06 GMT; Path=/; secure', 'X-Content-Type-Options': 'nosniff', 'X-Dropbox-Request-Id': 'b028e94ce7b814c7f25fb753449b641a', 'X-Frame-Options': 'DENY', 'X-Robots-Tag': 'noindex, nofollow, noimageindex', 'X-Xss-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Content-Encoding': 'gzip'}
文件被gzip压缩的问题,即使它已经是gzip压缩的文件(可以从
r.headers
中的'Content-Encoding': 'gzip'
字段中看到)。在对于},您使用的是默认的请求头。默认情况下,它们都将发送类似
requests
和{'Accept-Encoding: gzip, deflate'
的内容。(如果您打印出r.request.headers
),那么服务器可以轻松地gzip文件并用'Content-Encoding: gzip'
头将其发送回。在默认情况下,
wget
和requests
都将检测到该报头并透明地为您解码数据,但您已经明确地告诉requests
不要这样做,并按原样读取原始数据。在所以最后保存的文件是gzip压缩的gzip压缩tarball。显然,}将报告gzip
file
将报告为gzip compressed data
,而{does not look like a tar archive
中的内容,因为它不是,它是一个gzip压缩的tar存档。在这里最小的修正是手动将
headers={'Accept-Encoding': 'identity'}
添加到您的请求中。在您可能会想知道,为什么服务器会费心压缩gzip压缩gzip文件,只是因为您告诉它您可以接受gzip并不意味着您需要gzip,对吗?在
如果您看一下RFC 2616和RFC 7231,服务器应该选择它可以支持的、具有最大qvalue(权重)的编码(根据一些未指定的启发式方法断开连接)。如果您的用户代理显式地请求}将是不正确的,除非实际上不可能这样做,而不是有点傻。在
'gzip, deflate'
,那么给您{这太疯狂了,但是将URL末尾的
0
改为1
是可行的。从这个SO post出发。在相关问题 更多 >
编程相关推荐