将tor与scrapy fram结合使用

2024-05-17 14:49:55 发布

您现在位置：Python中文网/ 问答频道 /正文

715

网友

男 | 程序猿一只，喜欢编程写python代码。

我试图抓取网站，这是复杂到足以阻止机器人程序，我的意思是，它只允许几个请求，在那残破的挂起。

问题1：有没有办法，如果刮擦挂起，我可以重新开始我的爬行过程从同一点。为了解决这个问题，我这样写了我的设置文件

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

这是我的程序：

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

问题是我在哪里可以编写http代理，我是否需要导入任何与tor相关的类，我对Scrapy很陌生，因为这个组我学到了很多，现在我正试图学习“如何使用ip旋转或tor”

正如我们的一位成员建议的那样，我启动了tor并将HTTP_PROXY设置为

set http_proxy=http://localhost:8118

但它也犯了一些错误

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

所以我把http\u代理改成

set http_proxy=http://localhost:9051

现在错误是

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

我检查了firefox的网络设置，在那里我看不到任何http代理，但是它没有使用SOCKSV5，而是显示127.0.0.1:9051。（在没有代理的情况下使用TOR之前）请帮助我，我仍然不知道如何通过Scrapy使用TOR。我应该用哪包TOR，怎么用？我希望我的两个问题都能解决

如果由于某种原因（连接失败）挂起了一个无用的爬虫程序，我想从那里恢复服务本身
如何在刮板中使用旋转IPs

Tags： name 程序 true http 代理 version download bot

1条回答

网友

1楼 · 发布于 2024-05-17 14:49:55

TOR本身不是http代理，端口8118和连接被拒绝错误表明您没有正确运行privoxy[1]。请尝试正确设置privoxy，然后使用环境变量http_proxy=http://localhost:8118重试。

我已经成功地用privoxy和scrapy爬过TOR了。

[1]http://www.privoxy.org/

将tor与scrapy fram结合使用

相关问题更多 >

编程相关推荐

热门问题

热门文章

将tor与scrapy fram结合使用

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >