Scrapy中的验证码

6 投票

2 回答

8488 浏览

提问于 2025-04-16 21:14

我正在做一个Scrapy应用，想要通过一个有验证码的表单登录一个网站（这不是垃圾邮件）。我使用ImagesPipeline来下载验证码，并把它显示在屏幕上让用户来解决。目前为止一切顺利。

我想问的是，如何重新启动爬虫，以便提交验证码和表单信息？现在我的爬虫请求验证码页面，然后返回一个Item，里面包含验证码的image_url。接着这个验证码会被ImagesPipeline处理和下载，并展示给用户。我不太清楚如何继续爬虫的进程，并把用户解决后的captcha和相同的会话传递给爬虫，因为我觉得爬虫必须先返回这个项目（比如说退出），然后ImagesPipeline才能开始工作。

我查阅了文档和示例，但没有找到清楚的说明，告诉我该怎么做。

网页抓取用户交互会话管理 scrapy 验证码处理表单提交爬虫

2 个回答

我不会创建一个项目来使用图像处理工具。

import urllib
import os
import subprocess

...

def start_requests(self):
    request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
    return [request]      

def fill_login_form(self,response):
    x = HtmlXPathSelector(response)
    img_src = x.select("//img/@src").extract()

    #delete the captcha file and use urllib to write it to disk
    os.remove("c:\captcha.jpg")
    urllib.urlretrieve(img_src[0], "c:\captcha.jpg")

    # I use an program here to show the jpg (actually send it somewhere)
    captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")

    # OR just get the input from the user from stdin
    captcha = raw_input("put captcha in manually>")  

    # this function performs the request and calls the process_home_page with
    # the response (this way you can chain pages from start_requests() to parse()

    return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]

    def process_home_page(self, response):
        # check if you logged in etc. etc.

...

我在这里做的是导入 urllib.urlretrieve(url（用来存储图片），os.remove(file)（用来删除之前的图片），还有 subprocess.checoutput（用来调用外部命令行工具解决验证码）。在这个“黑科技”中，整个Scrapy框架并没有被使用，因为像这样解决验证码总是算是一种“黑科技”。

调用外部子进程的方式本来可以更简单一些，但这样也能工作。

在某些网站上，无法直接保存验证码图片，你必须在浏览器中打开页面，然后调用一个屏幕捕捉工具，精确裁剪出验证码的部分。这就是屏幕抓取。

回答于 2025-04-16 由 Python大师

分享举报

这就是你可能在爬虫里让它工作的方式。

self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()

一旦你收到请求，就暂停引擎，显示图片，读取用户的信息，然后通过提交一个登录的POST请求来继续爬取。

我很想知道这个方法在你的情况下是否有效。

回答于 2025-04-16 由 Python大师

分享举报

Scrapy中的验证码

2 个回答

撰写回答