如何下载通过HTML表单提交“间接”返回的文件?(python, urllib, urllib2等)
更新:问题解决了。最后发现只是我在网址中用了“http:”而不是“https:”,真是个愚蠢的错误。不过,cetver提供的干净代码示例帮我找到了问题所在。感谢所有提供建议的人。
把这个网址放到火狐浏览器中,会弹出合适的下载和另存为对话框:
https://www.virwox.com/orders.php?download_open=Download&format_open=.xls
上面的链接和在页面上提交一个带有“下载”按钮的表单是一样的,链接地址是 https://www.virwox.com/orders.php。
这是生成上述网址的表单的相关HTML代码:
<form action='orders.php' method='get'><fieldset><legend>Open Orders (2):</legend>
<input type='submit' value='Download' name='download_open' />
<select name='format_open'>
<option value='.xls'>.xls</option>
<option value='.csv'>.csv</option>
<option value='.xml'>.xml</option></select>
</form>
但是当我尝试以下的Python代码时(我本来就不指望它能成功)……
# get orders list
openOrders_url = virwoxTopLevel_url+"/orders.php"
openOrders_params = urlencode( { "download_open":"Download", "format_open":".xml" } )
openOrders_request = urllib2.Request(openOrders_url,openOrders_params,headers)
openOrders_response = virwox_opener.open(openOrders_request)
openOrders_xml = openOrders_response.read()
print(openOrders_xml)
……openOrders_xml最后只是原来的页面(https://www.virwox.com/orders.php)。
火狐浏览器是怎么知道还有一个文件可以下载的?我该如何在Python中检测并下载这个文件呢?
请注意,这不是安全或登录的问题,因为如果我有认证方面的麻烦,我根本无法打开orders.php页面。
更新:我在想这是否和重定向有关(我在使用基本的重定向处理器),或者我应该使用像urllib.fileretrieve()这样的东西。
更新:这是完整程序的代码,以防相关……
import urllib
import urllib2
import cookielib
import pprint
from urllib import urlencode
username=###############
password=###############
virwoxTopLevel_url = "http://www.virwox.com/"
overview_url = "https://www.virwox.com/index.php"
# Header
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
# Handlers...
# cookie handler...
cookie_handler= urllib2.HTTPCookieProcessor( cookielib.CookieJar() )
# redirect handler...
redirect_handler= urllib2.HTTPRedirectHandler()
# create "opener" (OpenerDirector instance)
virwox_opener = urllib2.build_opener(redirect_handler,cookie_handler)
# login
login_url = "https://www.virwox.com/index.php"
values = { 'uname' : username, 'password' : password }
login_data = urllib.urlencode(values)
login_request = urllib2.Request(login_url,login_data,headers)
login_response = virwox_opener.open(login_request)
overview_html = login_response.read();
virwox_json_url = "http://api.virwox.com/api/json.php"
getTest = urllib.urlencode( { "method":"getMarketDepth", "symbols[0]":"EUR/SLL","symbols[1]":"USD/SLL","buyDepth":1,"sellDepth":1,"id":1 } )
get_response = urllib2.urlopen(virwox_json_url,getTest)
#print get_response.read()
# get orders list
openOrders_url = virwoxTopLevel_url+"/orders.php"
openOrders_params = urlencode( { "download_open":"Download", "format_open":".xml" } )
openOrders_request = urllib2.Request(openOrders_url,openOrders_params,headers)
openOrders_response = virwox_opener.open(openOrders_request)
openOrders_xml = openOrders_response.read()
# the following prints the html of the /orders.php page not the desired download data:
print "******************************************"
print(openOrders_xml)
print "******************************************"
print openOrders_response.info()
print openOrders_response.geturl()
print "******************************************"
# the following prints nothing, i assume because without the cookie handler, fails to authenticate
# (note that authentication is by the php program, not html authentication, so no "authentication hangler" above
print urllib2.urlopen("https://www.virwox.com/orders.php?download_open=Download&format_open=.xml").read()
3 个回答
0
你可能需要一个像 urllib2.HTTPPasswordMgr
这样的东西(我没有你的用户名和密码,所以没法测试):
import urllib
import urllib2
uri = "http://www.virwox.com/"
url = uri + "orders.php"
uname = "USERNAME"
password = "PASSWORD"
post = urllib.urlencode({"download_open":"Download", "format_open":".xls"})
pwMgr = urllib2.HTTPPasswordMgr()
pwMgr.add_password(realm=None, uri=uri, user=uname, passwd=password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPDigestAuthHandler(pwMgr)))
req = urllib2.Request(url, post)
s = urllib2.urlopen(req)
cookie = s.headers['Set-Cookie']
s.close()
req.add_header('Cookie', cookie)
s = urllib2.urlopen(req)
source = s.read()
s.close()
然后,你可以:
print source
来查看它是否包含你需要的xml数据。
1
看起来你的问题已经有人回答了,不过你可以看看这个Requests库。它其实是对标准库工具的一个很好的封装。下面的代码(可能)就是你想要的效果。
import requests
r = requests.get('http://www.virwox.com/orders.php',
allow_redirects=True,
auth=('user', 'pass'),
data={'download_open': 'Download', 'format_open': '.xls'})
print r.content
1
下面的代码没有经过测试
类似这样的:
import urllib, urllib2,
HOST = 'https://www.virwox.com'
FORMS = {
'login': {
'action': HOST + '/index.php',
'data': urllib.urlencode( {
'uname':'username',
'password':'******'
} )
},
'orders': {
'action': HOST + '/orders.php',
'data': urllib.urlencode( {
'download_open':'Download',
'format_open':'.xml'
} )
}
}
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor() )
try:
req = urllib2.Request( url = FORMS['login']['action'], data = FORMS['login']['data'] )
opener.open( req ) #save login cookie
print 'Login: OK'
except Exception, e:
print 'Login: Fail'
print e
try:
req = urllib2.Request( url = FORMS['orders']['action'], data = FORMS['orders']['data'] )
print 'Orders Page: OK'
except Exception, e:
print 'Orders Page: Fail'
print e
try:
xml = opener.open( req ).read()
print xml
except Exception, e:
print 'Obtain XML: Fail'
print e