用Python的Requests发送ASP.net POST请求
我正在用Python的requests模块抓取一个旧的ASP.net网站。
我花了超过5个小时试图弄明白怎么模拟这个POST请求,但一直没有成功。按照我下面的方式做,我基本上得到的信息是“没有与这个项目引用匹配的项目。”
任何帮助都会非常感激——这是请求和我的代码,出于简洁和隐私的考虑,有一些内容进行了修改:
我自己的代码:
import requests
# Scraping the item number from the website, I have confirmed this is working.
#Then use the newly acquired item number to request the data.
item_url = http://www.example.com/EN/items/Pages/yourrates.aspx?vr= + item_number[0]
viewstate = r'/wEPD...' # Truncated for brevity.
# Create the appropriate request and payload.
payload = {"vr": int(item_number[0])}
item_request_body = {
"__SPSCEditMenu": "true",
"MSOWebPartPage_PostbackSource": "",
"MSOTlPn_SelectedWpId": "",
"MSOTlPn_View": 0,
"MSOTlPn_ShowSettings": "False",
"MSOGallery_SelectedLibrary": "",
"MSOGallery_FilterString": "",
"MSOTlPn_Button": "none",
"__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"MSOAuthoringConsole_FormContext": "",
"MSOAC_EditDuringWorkflow": "",
"MSOSPWebPartManager_DisplayModeName": "Browse",
"MSOWebPartPage_Shared": "",
"MSOLayout_LayoutChanges": "",
"MSOLayout_InDesignMode": "",
"MSOSPWebPartManager_OldDisplayModeName": "Browse",
"MSOSPWebPartManager_StartWebPartEditingName": "false",
"__VIEWSTATE": viewstate,
"keywords": "Search our site",
"__CALLBACKID": "ctl00$SPWebPartManager1$g_dbb9e9c7_fe1d_46df_8789_99a6c9db4b22",
"__CALLBACKPARAM": "startvr"
}
# Write the appropriate headers for the property information.
item_request_headers = {
"Host": home_site,
"Connection": "keep-alive",
"Content-Length": len(encoded_valuation_request),
"Cache-Control": "max-age=0",
"Origin": home_site,
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie": "__utma=48409910.1174413745.1405662151.1406402487.1406407024.17; __utmb=48409910.7.10.1406407024; __utmc=48409910; __utmz=48409910.1406178827.13.3.utmcsr=ratesandvallandingpage|utmccn=landingpages|utmcmd=button",
"Accept": "*/*",
"Referer": valuation_url,
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "en-US,en;q=0.8"
}
response = requests.post(url=item_url, params=payload, data=item_request_body, headers=item_request_headers)
print response.text
Chrome告诉我的请求看起来是这样的:
Remote Address:202.55.96.131:80
Request URL:http://www.example.com/EN/items/Pages/yourrates.aspx?vr=123456789
Request Method:POST
Status Code:200 OK
Request Headers
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:21501
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:__utma=48409910.1174413745.1405662151.1406402487.1406407024.17; __utmb=48409910.7.10.1406407024; __utmc=48409910; __utmz=48409910.1406178827.13.3.utmcsr=ratesandvallandingpage|utmccn=landingpages|utmcmd=button
Host:www.site.com
Origin:www.site.com
Referer:http://www.example.com/EN/items/Pages/yourrates.aspx?vr=123456789
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Query String Parameters
vr:123456789
Form Data
__SPSCEditMenu:true
MSOWebPartPage_PostbackSource:
MSOTlPn_SelectedWpId:
MSOTlPn_View:0
MSOTlPn_ShowSettings:False
MSOGallery_SelectedLibrary:
MSOGallery_FilterString:
MSOTlPn_Button:none
__EVENTTARGET:
__EVENTARGUMENT:
MSOAuthoringConsole_FormContext:
MSOAC_EditDuringWorkflow:
MSOSPWebPartManager_DisplayModeName:Browse
MSOWebPartPage_Shared:
MSOLayout_LayoutChanges:
MSOLayout_InDesignMode:
MSOSPWebPartManager_OldDisplayModeName:Browse
MSOSPWebPartManager_StartWebPartEditingName:false
__VIEWSTATE:/wEPD...(Omitted for length)
keywords:Search our site
__CALLBACKID:ctl00$SPWebPartManager1$g_dbb9e9c7_fe1d_46df_8789_99a6c9db4b22
__CALLBACKPARAM:startvr
2 个回答
虽然这和问题标题相关,但并不完全符合提问者的情况——我想在Martijn的回答中再补充一个有用的小建议,主要是关于使用requests库进行POST请求的一些通用建议。
在浏览器中查看请求的内容(比如在Chrome的开发者工具的网络标签中)可能会发现,有些键/字段在请求内容中出现了多个。
Chrome的请求内容示例:
...
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "AcceptedModified",
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "InvoiceFullyDisputed",
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "DisputedItemsClosed",
...
如果你直接复制浏览器的请求内容,想在requests的payload/data参数中完全匹配,这样是行不通的(或者至少不会得到你期待的结果……你可能还是会收到200状态码的响应)——这样做只会发送最后一个键/字段的值。
Requests 数据/请求内容,这样是行不通的(或者至少不会得到你期待的结果):
payload = {
...
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "AcceptedModified",
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "InvoiceFullyDisputed",
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "DisputedItemsClosed",
...
}
r = session.post(url, headers=headers, data=payload)
相反,你需要把这些多个键/字段的值放到一个列表里:
Requests 数据/请求内容,这样才会有效(或者得到你期待的结果):
payload = {
...
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": ["AcceptedModified", "InvoiceFullyDisputed", "DisputedItemsClosed"],
...
}
r = session.post(url, headers=headers, data=payload)
……我花了几个小时才意识到这一点,深入研究ASP.NET网站的机制,以为需要理解那里的某些东西。结果并不是这样。所以,我只是想帮别人节省一些时间,希望能有所帮助。
感谢这个Stack Overflow问题,让我意识到了这一点。
注意:你可以通过查看响应对象中的r.request.body
来准确检查你发送的请求内容(在这个例子中是r
)。这就是我意识到我的请求缺少一些信息(也就是多个字段/键)的方式。
你请求的参数太多了,而且不应该自己设置内容类型、内容长度、主机、来源或连接这些头信息;这些交给requests
来处理就可以了。
你还在URL参数上重复了;要么手动把vr
参数加到URL里,要么使用params
,不要两者都做。
可能在POST请求的内容里,有些参数是由和会话相关的ASP应用生成的。我建议你用GET请求和一个Session对象去访问valuation_url
,然后解析那个页面的表单,提取__CALLBACKID
参数。这样,requests的Session会存储服务器设置的任何cookie,并且可以重复使用这些cookie:
item_request_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
"Accept": "*/*",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "en-US,en;q=0.8"
}
payload = {"vr": int(item_number[0])}
session = requests.Session(headers=item_request_headers)
# Get form page
form_response = session.get(validation_url, params=payload)
# parse form page; BeautifulSoup could do this for example
soup = BeautifulSoup(form_response.content)
callbackid = soup.select('input[name=__CALLBACKID]')[0]['value']
item_request_body = {
"__SPSCEditMenu": "true",
"MSOWebPartPage_PostbackSource": "",
"MSOTlPn_SelectedWpId": "",
"MSOTlPn_View": 0,
"MSOTlPn_ShowSettings": "False",
"MSOGallery_SelectedLibrary": "",
"MSOGallery_FilterString": "",
"MSOTlPn_Button": "none",
"__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"MSOAuthoringConsole_FormContext": "",
"MSOAC_EditDuringWorkflow": "",
"MSOSPWebPartManager_DisplayModeName": "Browse",
"MSOWebPartPage_Shared": "",
"MSOLayout_LayoutChanges": "",
"MSOLayout_InDesignMode": "",
"MSOSPWebPartManager_OldDisplayModeName": "Browse",
"MSOSPWebPartManager_StartWebPartEditingName": "false",
"__VIEWSTATE": viewstate,
"keywords": "Search our site",
"__CALLBACKID": callbackid,
"__CALLBACKPARAM": "startvr"
}
item_url = 'http://www.example.com/EN/items/Pages/yourrates.aspx'
response = session.post(url=item_url, params=payload, data=item_request_body,
headers={'Referer': form_response.url})
这个会话会处理头信息(设置用户代理和接受参数),只有在使用会话的POST请求时,我们才会添加一个来源头信息。