如何在Python2.7中获得真正的文件url?

2024-06-02 04:24:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个url http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip,它将我“重定向”到http://images.vbb.de/assets/ftp/file/286316.zip。在引号中重定向,因为python说没有重定向:

    In [51]: response = requests.get('http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
        ...: if response.history:
        ...:     print "Request was redirected"
        ...:     for resp in response.history:
        ...:         print resp.status_code, resp.url
        ...:     print "Final destination:"
        ...:     print response.status_code, response.url
        ...: else:
        ...:     print "Request was not redirected"
        ...:     
    Request was not redirected

状态码也是200。response.history什么都没有。response.url给出了第一个url,而不是真正的url。但在firefox->开发者工具->网络中可以获得真实的url。如何在Python2.7中生成?提前谢谢!!在


Tags: httpurlresponserequestwwwdezipresp
2条回答

您可以使用BeautifulSoup来读取HTML页面标题中的meta标记并获得重定向URL

>>> import requests
>>> from bs4 import BeautifulSoup
>>> a = requests.get("http://www.vbb.de/de/datei/GTFS_VBB_Nov2015_Dez2016.zip")
>>> soup = BeautifulSoup(a.text, 'html.parser')
>>> soup.find_all('meta', attrs={'http-equiv': lambda x:x.lower() == 'refresh'})[0]['content'].split('URL=')[1]
'/de/download/GTFS_VBB_Nov2015_Dez2016.zip'

此URL将相对于原始URL的域,使新URL http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip。下载此文件会为我下载ZIP文件:

^{pr2}$
 $ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
    -          -     
     5554  2015-11-20 15:17   agency.txt
  2151517  2015-11-20 15:17   calendar_dates.txt
    71731  2015-11-20 15:17   calendar.txt
    65424  2015-11-20 15:17   routes.txt
   816498  2015-11-20 15:17   stops.txt
196020096  2015-11-20 15:17   stop_times.txt
   365499  2015-11-20 15:17   transfers.txt
 11765292  2015-11-20 15:17   trips.txt
      113  2015-11-20 15:17   logging
    -                        -
211261724                     9 files

在此重定向中,返回301状态:

>>> a.history
[<Response [301]>]
>>> a
<Response [200]>
>>> a.history[0]
<Response [301]>
>>> a.history[0].url
'http://www.vbb.de/de/download/GTFS_VBB_Nov2015_Dez2016.zip'
>>> a.url
'http://images.vbb.de/assets/ftp/file/286316.zip'

首先需要通过解析第一个返回的HTML中的新的window.location.href来手动执行重定向。然后创建一个301回复,其中包含返回的Location头中包含的目标文件的名称:

import requests
import re
import os

base_url = 'http://www.vbb.de'
response = requests.get(base_url + '/de/datei/GTFS_VBB_Nov2015_Dez2016.zip')
manual_redirect = base_url + re.findall('window.location.href\s+=\s+"(.*?)"', response.text)[0]
response = requests.get(manual_redirect, stream=True)
target_filename = response.history[0].headers['Location'].split('/')[-1]

print "Downloading: '{}'".format(target_filename)
with open(target_filename, 'wb') as f_zip:
    for chunk in response.iter_content(chunk_size=1024):
        f_zip.write(chunk)

这将显示:

^{pr2}$

并生成一个29464299字节的zip文件。在

相关问题 更多 >