用scrapy爬行SSL站点

网友

1楼 · 编辑于 2024-05-29 03:18:23

剪贴簿中的PR fixing this problem已合并。最近（2016年2月）有另一个pull请求修复similar bug

我看到用最新的剪贴本我可以把你的网页，但与旧版本的问题仍然出现。

一般来说，如果您偶然发现Scrapy的HTTP-s问题，那么解决方案是：

将Scrapy升级到最新版本
如果不是最新Twisted版本的最新更新，请检查您使用的Twisted版本（从编写版本时起，在SSL方面，14个以上的版本被确认明显更好）

如果在更新Scrapy和Twisted之后仍然遇到问题，则可能需要将ScrapyClientContextFactory子类化-有关详细信息，请参见下面的答案。

更多详细信息请参见this github issue

网友

2楼 · 编辑于 2024-05-29 03:18:23

1.添加DOWNLOADER_CLIENTCONTEXTFACTORY='testproject.CustomContext.CustomClientContextFactory' 到您的设置.py

2.在项目目录中创建名为CustomContext.py的文件并添加以下代码

from OpenSSL import SSL
from twisted.internet.ssl import ClientContextFactory
from twisted.internet._sslverify import ClientTLSOptions
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
class CustomClientContextFactory(ScrapyClientContextFactory):

    def getContext(self, hostname=None, port=None):
        ctx = ClientContextFactory.getContext(self)
        # Enable all workarounds to SSL bugs as documented by
        # http://www.openssl.org/docs/ssl/SSL_CTX_set_options.html
        ctx.set_options(SSL.OP_ALL)
        if hostname:
            ClientTLSOptions(hostname, ctx)
        return ctx

注意：它在windows中对https站点的爬行很有效，但是当我在Ubuntu 14.04中尝试同样的方法时，它会抛出如下错误：

from twisted.internet._sslverify import ClientTLSOptions
exceptions.ImportError: cannot import name ClientTLSOptions

如果有人能为上述错误添加解决方案，那就太好了。

编辑：

而不是使用from twisted.internet._sslverify import ClientTLSOptions

我把它改成了下面的

try:
    # available since twisted 14.0
    from twisted.internet._sslverify import ClientTLSOptions
except ImportError:
    ClientTLSOptions = None

网友
3楼 · 编辑于 2024-05-29 03:18:23

任何具有“TypeError:unbound method getContext（）的人都必须以ClientContextFactory实例作为第一个参数来调用…”

替换ctx = ClientContextFactory.getContext(self)

用ctx = ScrapyClientContextFactory.getContext(self)

相关问题更多 >

编程相关推荐

热门问题

热门文章