使用Scrapy或requests modu提交表单时出现HTTP“400”或“404”错误

2024-05-14 08:58:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试抓取一个webform并提交它,使用带有数据的post请求并尝试获得响应。但是使用Scrapy我得到了HTTP错误代码“400”,而使用requests模块,我得到了“404”错误。你知道吗

总的来说,我正在尝试下载一个PDF文档,它在提交所有表单详细信息后在浏览器中显示,成功。我正在尝试自动化整个工作流程。你知道吗

这是web表单的URL。现在,当您从下拉列表中选择任何值时,表单会使用一些参数向同一URL发出POST请求,并返回包含HTML代码的响应。此HTML代码具有下一个下拉字段的值。所以,基本上,有两个相关的下拉列表。选择每个下拉列表后,该网页会向自身发出POST请求,并用从请求中得到的响应刷新和更新网页。你知道吗

这是我使用请求模块的代码

from bs4 import BeautifulSoup
import requests

url = 'https://ceo.maharashtra.gov.in/SearchList/'

response = requests.get(url, verify=False)

soup = BeautifulSoup(response.text, 'html.parser')

headers = {    
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9,hi;q=0.8",
    "Cache-Control": "max-age=0",
    "Connection": "keep-alive",
    "Content-Length": "2646",
    "Content-Type": "application/x-www-form-urlencoded",
    "Cookie": "ASP.NET_SessionId=wxfogxzgadg3gjxokbo0rcbn",
    "Host": "ceo.maharashtra.gov.in",
    "Origin": "https://ceo.maharashtra.gov.in",
    "Referer": "https://ceo.maharashtra.gov.in/SearchList/",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
}

viewstate = soup.find('input', attrs={'id' : '__VIEWSTATE'})['value']
eventvalidation = soup.find('input', attrs={'id' : '__EVENTVALIDATION'})['value']

districtList = soup.find('select', attrs={'id' : 'Content_DistrictList'}).find_all('option')[1]['value']

data = {
    '__EVENTTARGET' : 'ctl00$Content$DistrictList',
    '__EVENTARGUMENT' : '',
    '__LASTFOCUS' : '', 
    '__VIEWSTATE' : viewstate,
    '__EVENTVALIDATION' : eventvalidation,
    'ctl00$Content$DistrictList' : districtList,
    'ctl00$Content$AssemblyList' : '0',
    'ctl00$Content$PartList' : '0',
    'ctl00$Content$txtcaptcha' : ''
}

response_1 = requests.post(url, params=data, headers=headers, verify=False)

print(response_1.text)

这是我的代码,使用的是

# -*- coding: utf-8 -*-
import scrapy


class ElectionSpider(scrapy.Spider):
    name = 'election'
    allowed_domains = ['ceo.maharashtra.gov.in']
    start_urls = ['https://ceo.maharashtra.gov.in/SearchList/']
    dist_dict = []

    headers = {    
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9,hi;q=0.8",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Content-Length": "2646",
        "Content-Type": "application/x-www-form-urlencoded",
        "Cookie": "ASP.NET_SessionId=wxfogxzgadg3gjxokbo0rcbn",
        "Host": "ceo.maharashtra.gov.in",
        "Origin": "https://ceo.maharashtra.gov.in",
        "Referer": "https://ceo.maharashtra.gov.in/SearchList/",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"
    }

    def parse(self, response):
        for district in response.css('select#Content_DistrictList > option')[1:]:
            val = district.css('::attr(value)').extract_first()
            name = district.css('::text').extract_first()

            data = {
                '__EVENTTARGET' : response.css('select#Content_DistrictList::attr(name)').extract_first(),
                '__EVENTARGUMENT' : '',
                '__LASTFOCUS' : '', 
                '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
                '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
                'ctl00$Content$DistrictList' : val,
                'ctl00$Content$AssemblyList' : '0',
                'ctl00$Content$PartList' : '0',
                'ctl00$Content$txtcaptcha' : ''
            }
            meta = {'handle_httpstatus_all': True}
            print(data)
            yield scrapy.FormRequest(url=self.start_urls[0], method='POST', headers=self.headers, formdata=data, meta=meta, callback=self.parse_assembly)
            break

    def parse_assembly(self, response):
        print(response.text)

现在,当我尝试使用requests和Scrapy发出POST请求时,我分别得到HTTP错误'404'和'400'。我找不到解决这个问题的办法。你知道吗

所以,请帮帮我。谢谢。你知道吗

注意:对于此任务,我不能使用任何web浏览器自动化工具,如Selenium。你知道吗


Tags: textinhttpsapplicationvalueresponsecontentcss

热门问题