我正在尝试抓取一个webform并提交它,使用带有数据的post请求并尝试获得响应。但是使用Scrapy我得到了HTTP错误代码“400”,而使用requests模块,我得到了“404”错误。你知道吗
总的来说,我正在尝试下载一个PDF文档,它在提交所有表单详细信息后在浏览器中显示,成功。我正在尝试自动化整个工作流程。你知道吗
这是web表单的URL。现在,当您从下拉列表中选择任何值时,表单会使用一些参数向同一URL发出POST请求,并返回包含HTML代码的响应。此HTML代码具有下一个下拉字段的值。所以,基本上,有两个相关的下拉列表。选择每个下拉列表后,该网页会向自身发出POST请求,并用从请求中得到的响应刷新和更新网页。你知道吗
这是我使用请求模块的代码
from bs4 import BeautifulSoup
import requests
url = 'https://ceo.maharashtra.gov.in/SearchList/'
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9,hi;q=0.8",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Content-Length": "2646",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "ASP.NET_SessionId=wxfogxzgadg3gjxokbo0rcbn",
"Host": "ceo.maharashtra.gov.in",
"Origin": "https://ceo.maharashtra.gov.in",
"Referer": "https://ceo.maharashtra.gov.in/SearchList/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
}
viewstate = soup.find('input', attrs={'id' : '__VIEWSTATE'})['value']
eventvalidation = soup.find('input', attrs={'id' : '__EVENTVALIDATION'})['value']
districtList = soup.find('select', attrs={'id' : 'Content_DistrictList'}).find_all('option')[1]['value']
data = {
'__EVENTTARGET' : 'ctl00$Content$DistrictList',
'__EVENTARGUMENT' : '',
'__LASTFOCUS' : '',
'__VIEWSTATE' : viewstate,
'__EVENTVALIDATION' : eventvalidation,
'ctl00$Content$DistrictList' : districtList,
'ctl00$Content$AssemblyList' : '0',
'ctl00$Content$PartList' : '0',
'ctl00$Content$txtcaptcha' : ''
}
response_1 = requests.post(url, params=data, headers=headers, verify=False)
print(response_1.text)
这是我的代码,使用的是
# -*- coding: utf-8 -*-
import scrapy
class ElectionSpider(scrapy.Spider):
name = 'election'
allowed_domains = ['ceo.maharashtra.gov.in']
start_urls = ['https://ceo.maharashtra.gov.in/SearchList/']
dist_dict = []
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9,hi;q=0.8",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Content-Length": "2646",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "ASP.NET_SessionId=wxfogxzgadg3gjxokbo0rcbn",
"Host": "ceo.maharashtra.gov.in",
"Origin": "https://ceo.maharashtra.gov.in",
"Referer": "https://ceo.maharashtra.gov.in/SearchList/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
}
def parse(self, response):
for district in response.css('select#Content_DistrictList > option')[1:]:
val = district.css('::attr(value)').extract_first()
name = district.css('::text').extract_first()
data = {
'__EVENTTARGET' : response.css('select#Content_DistrictList::attr(name)').extract_first(),
'__EVENTARGUMENT' : '',
'__LASTFOCUS' : '',
'__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'ctl00$Content$DistrictList' : val,
'ctl00$Content$AssemblyList' : '0',
'ctl00$Content$PartList' : '0',
'ctl00$Content$txtcaptcha' : ''
}
meta = {'handle_httpstatus_all': True}
print(data)
yield scrapy.FormRequest(url=self.start_urls[0], method='POST', headers=self.headers, formdata=data, meta=meta, callback=self.parse_assembly)
break
def parse_assembly(self, response):
print(response.text)
现在,当我尝试使用requests和Scrapy发出POST请求时,我分别得到HTTP错误'404'和'400'。我找不到解决这个问题的办法。你知道吗
所以,请帮帮我。谢谢。你知道吗
注意:对于此任务,我不能使用任何web浏览器自动化工具,如Selenium。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐