如何使用Scrapy在一个POST请求上捕获多个响应?

2024-05-28 22:46:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试网页抓取this网站和下载pdf文件,当你完成这个网站的整个生命周期。我用的是刮痧。在正确的时间捕获验证码时,我遇到了一些问题。在

此网站是一个ASPX网页,并使用“Viewstates”来跟踪每个POST请求。现在,如果你浏览这个网站,你就会明白,每当你填写任何下拉字段时,它都会将带有“Viewstate”值的POST请求发送到某个URL路径,你可以在浏览器控制台中看到这个路径。但同时,它会向另一个URL发送另一个GET请求,以获取“CAPTCHA”图像。但我无法得到这样的回应。我不知道使用scray是否可以同时捕获多个请求和多个响应。在

enter image description here

enter image description here

现在,我试图为这个问题找到一个解决办法。我几乎关注了这篇StackOverflow帖子中提到的所有内容,但作为回应,我得到的HTML代码带有javascript警告代码,其中提到“插入了错误的文本,请输入图像文本框中显示的新字符”。所以,这个解决方案对我也不管用。在

这是我的蜘蛛代码:

# -*- coding: utf-8 -*-
import scrapy
import cv2
import pytesseract
from PIL import Image
from io import BytesIO
from election_data.items  import ElectionDataItem

class ElectionSpider(scrapy.Spider):
    name = 'election'
    allowed_domains = ['ceo.maharashtra.gov.in']
    start_urls = ['https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx']
    dist_dict = []

    def parse(self, response):
        district = response.css('select#Content_DistrictList > option::attr(value)')[1].extract()
        data = {
            '__EVENTTARGET' : response.css('select#Content_DistrictList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : district,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_assembly)
        request.meta['district'] = district
        yield request

    def parse_assembly(self, response):
        print('parse_assembly')
        assembly = response.css('select#Content_AssemblyList > option::attr(value)')[1].extract()
        data = {
            '__EVENTTARGET' : response.css('select#Content_AssemblyList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : assembly,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_part)
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = assembly
        yield request

    def parse_part(self, response):
        print('parse_part')
        part = response.css('select#Content_PartList > option::attr(value)')[1].extract()
        data = {
            '__EVENTTARGET' : response.css('select#Content_PartList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : response.meta['assembly'],
            'ctl00$Content$PartList' : part,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_captcha)
        request.meta['__VIEWSTATE'] = response.css('input#__VIEWSTATE::attr(value)').extract_first()
        request.meta['__EVENTVALIDATION'] = response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = part
        yield request

    def parse_captcha(self, response):
        data_for_later = response
        request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx', callback=self.store_image)
        request.meta['__VIEWSTATE'] = response.meta['__VIEWSTATE']
        request.meta['__EVENTVALIDATION'] = response.meta['__EVENTVALIDATION']
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = response.meta['part']
        request.meta['data_for_later'] = data_for_later
        yield request

    def store_image(self, response):
        captcha_target_filename = 'filename.png'
        # save the image for processing
        i = Image.open(BytesIO(response.body))
        i.save(captcha_target_filename)
        captcha_text = self.solve_captcha(captcha_target_filename)
        print(captcha_text)
        data = {
            '__EVENTTARGET' : '',
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.meta['__VIEWSTATE'],
            '__EVENTVALIDATION' : response.meta['__EVENTVALIDATION'],
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : response.meta['assembly'],
            'ctl00$Content$PartList' : response.meta['part'],
            'ctl00$Content$txtcaptcha' : captcha_text,
            'ctl00$Content$OpenButton': 'Open PDF'
        }
        captcha_form = response.meta['data_for_later']
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest.from_response(captcha_form, method='POST', formdata=data, meta=meta, callback=self.get_pdfs)
        yield request

    def get_pdfs(self, response):
        # THIS IS WHERE FINAL RESPONSE IS CAPTURED
        print(response.text)

    def solve_captcha(self, image):
        image = cv2.imread(image,0)
        thresh = cv2.threshold(image, 220, 255, cv2.THRESH_BINARY)[1]

        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
        close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)

        result = 255 - close
        cv2.imshow('thresh', thresh)
        cv2.imshow('close', close)
        cv2.imshow('result', result)

        return pytesseract.image_to_string(result)

如果您浏览上述站点并填写所有表单详细信息,监视浏览器控制台的“网络”选项卡,您将了解此问题。在

请指导我如何解决这个问题。非常感谢。在


Tags: selfdataresponserequestassemblyextractcontentcss
2条回答

这就是我讨厌的原因ASP.NET应用程序,只会让你抓狂。不管怎样,你的一切都很完美,除了一件事

def parse_captcha(self, response):
    data_for_later = response
    request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx', callback=self.store_image)
    request.meta['__VIEWSTATE'] = response.meta['__VIEWSTATE']
    request.meta['__EVENTVALIDATION'] = response.meta['__EVENTVALIDATION']
    request.meta['district'] = response.meta['district']
    request.meta['assembly'] = response.meta['assembly']
    request.meta['part'] = response.meta['part']
    request.meta['data_for_later'] = data_for_later
    yield request

这来自于您设置part的响应,但是您要做的是在设置部件之前复制__VIEWSTATE和{}。所以你需要确保捕捉到正确的状态

^{pr2}$

还没有答案,但有几点建议:

  1. 你启用了Cookies吗?有一个ASP.NET_会话IDcookie在这个网站上的每一个请求都被传递。

  2. 你得到的请求验证码的反应看起来还好吗?

  3. 这一长串的请求很难理解,并且可能包含很难发现的bug。建议在第一步中,首先集中精力解决验证码:

    • 如果我只选择了地区并填写了错误的或正确的验证码解决方案,我要么会收到“错误的验证码”消息,要么会收到“选择正确的详细信息”。在
    • 因此,进入“选择正确的细节”更容易(更少的请求/移动部件),但已经显示了您是否正确地解决了验证码,所以我建议您先尝试一下,然后再根据这个结果进行构建。在

除此之外,你的方法看起来不错,没有明显的问题。在

顺便说一句:最后可能会发现模拟完整的请求序列是不必要的,跳过最后两个请求以获得最终验证码并发送最终表单提交。。。但这并不能帮助我们,只是为了以后的重构和简化代码。在

相关问题 更多 >

    热门问题