如何使用Scrapy通过代理进行带有身份验证的互联网连接

2024-04-19 06:46:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我的internet连接是通过带有身份验证的代理进行的,当我试图运行scraoy库以制作更简单的示例时,例如:

scrapy shell http://stackoverflow.com

在您使用XPath选择器请求某些内容之前,一切正常,响应是下一个:

>>> hxs.select('//title')
[<HtmlXPathSelector xpath='//title' data=u'<title>ERROR: Cache Access Denied</title'>]

或者,如果试图运行在项目中创建的任何蜘蛛,则会出现以下错误:

C:\Users\Victor\Desktop\test\test>scrapy crawl test
2012-08-11 17:38:02-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: test)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddlewa
re, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-08-11 17:38:02-0400 [test] INFO: Spider opened
2012-08-11 17:38:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
4
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2012-08-11 17:38:47-0400 [test] DEBUG: Retrying <GET http://automation.whatismyi
p.com/n09230945.asp> (failed 1 times): TCP connection timed out: 10060: Se produ
jo un error durante el intento de conexi¾n ya que la parte conectada no respondi
¾ adecuadamente tras un periodo de tiempo, o bien se produjo un error en la cone
xi¾n establecida ya que el host conectado no ha podido responder..
2012-08-11 17:39:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
...
2012-08-11 17:39:29-0400 [test] INFO: Closing spider (finished)
2012-08-11 17:39:29-0400 [test] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
  'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
     'downloader/request_bytes': 732,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 8, 11, 21, 39, 29, 908000),
     'log_count/DEBUG': 9,
     'log_count/ERROR': 1,
     'log_count/INFO': 5,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2012, 8, 11, 21, 38, 2, 876000)}
2012-08-11 17:39:29-0400 [test] INFO: Spider closed (finished)

看来问题出在我的代理人身上。如果有人知道如何使用带认证代理的scrapy,请告诉我。


Tags: debugtestinfodatetimetitlecountenableditems
2条回答

重复Mahmoud M.Abdel Fattah的回答,因为该页现在不可用。不过,我还是对他做了些小小的修改。

如果middlewares.py已经存在,请在其中添加以下代码。

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass.encode())
        #encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + \
            str(encoded_user_pass)

在settings.py文件中,添加以下代码

    DOWNLOADER_MIDDLEWARES = {
    'project_name.middlewares.ProxyMiddleware': 100,
}

这应该通过设置http_proxy来工作。然而,在我的例子中,我试图用HTTPS协议访问一个URL,需要设置https_proxy,我仍在研究。这方面的任何线索都会有很大帮助。

Scrapy通过使用HttpProxyMiddleware支持代理:

This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value to Request objects. Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:

  • http_proxy
  • https_proxy
  • no_proxy

另见:

相关问题 更多 >