如何使用Scrapy通过代理进行带有身份验证的互联网连接

C:\Users\Victor\Desktop\test\test>scrapy crawl test 2012-08-11 17:38:02-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: test) 2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon sole, CloseSpider, WebService, CoreStats, SpiderState 2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddlewa re, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle ware 2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled item pipelines: 2012-08-11 17:38:02-0400 [test] INFO: Spider opened 2012-08-11 17:38:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-08-11 17:38:02-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602 4 2012-08-11 17:38:02-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081 2012-08-11 17:38:47-0400 [test] DEBUG: Retrying <GET http://automation.whatismyi p.com/n09230945.asp> (failed 1 times): TCP connection timed out: 10060: Se produ jo un error durante el intento de conexi¾n ya que la parte conectada no respondi ¾ adecuadamente tras un periodo de tiempo, o bien se produjo un error en la cone xi¾n establecida ya que el host conectado no ha podido responder.. 2012-08-11 17:39:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ... 2012-08-11 17:39:29-0400 [test] INFO: Closing spider (finished) 2012-08-11 17:39:29-0400 [test] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3, 'downloader/request_bytes': 732, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 8, 11, 21, 39, 29, 908000), 'log_count/DEBUG': 9, 'log_count/ERROR': 1, 'log_count/INFO': 5, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2012, 8, 11, 21, 38, 2, 876000)} 2012-08-11 17:39:29-0400 [test] INFO: Spider closed (finished)

2条回答

网友

1楼 · 编辑于 2024-05-13 07:23:49

重复Mahmoud M.Abdel Fattah的回答，因为该页现在不可用。不过，我还是对他做了些小小的修改。

如果middlewares.py已经存在，请在其中添加以下代码。

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass.encode())
        #encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + \
            str(encoded_user_pass)

在settings.py文件中，添加以下代码

    DOWNLOADER_MIDDLEWARES = {
    'project_name.middlewares.ProxyMiddleware': 100,
}

这应该通过设置http_proxy来工作。然而，在我的例子中，我试图用HTTPS协议访问一个URL，需要设置https_proxy，我仍在研究。这方面的任何线索都会有很大帮助。

网友

2楼 · 编辑于 2024-05-13 07:23:49

Scrapy通过使用HttpProxyMiddleware支持代理：

This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value to Request objects. Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy

另见：

相关问题更多 >

编程相关推荐

热门问题

热门文章