脏兮兮的蜘蛛停止工作了

2024-04-20 02:45:27 发布

您现在位置:Python中文网/ 问答频道 /正文

史前史: 我在python2.7.2+上运行scrapyversion0.16.2,它在linuxmint上运行。 几天前I had this problem 在帮助下,我成功地克服了它。有一阵子爬虫正常工作:

2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled item pipelines: 
2013-11-23 01:02:51+0200 [basketsp17] INFO: Spider opened
2013-11-23 01:02:51+0200 [basketsp17] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6024
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Redirecting (301) to <GET http://www.euroleague.net/main/results/by-date> from <GET http://www.euroleague.net/main/results/by-date/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.euroleaguebasketball.net': <GET http://www.euroleaguebasketball.net/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.eurocupbasketball.com': <GET http://www.eurocupbasketball.com/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.euroleague.tv': <GET http://www.euroleague.tv/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.euroleaguestore.net': <GET http://www.euroleaguestore.net/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'fantasychallenge.euroleague.net': <GET http://fantasychallenge.euroleague.net/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/TheEuroleague>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.youtube.com': <GET http://www.youtube.com/euroleague>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'euroleaguedevotion.ourtoolbar.com': <GET http://euroleaguedevotion.ourtoolbar.com/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'euroleague.synapticdigital.com': <GET http://euroleague.synapticdigital.com/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'twitter.com': <GET http://twitter.com/Euroleague>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'kort.es': <GET http://kort.es/ulpGt>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'adserver.itsfogo.com': <GET http://adserver.itsfogo.com/click.aspx?zoneid=136145>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/> (referer: http://www.euroleague.net/main/results/by-date)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/devotion/home> (referer: http://www.euroleague.net/main/results/by-date)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/euroleaguenews/transactions/2013-14-signings> (referer: http://www.euroleague.net/main/results/by-date)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/features/blog/2013-2014> (referer: http://www.euroleague.net/main/results/by-date)

但几次之后它停止了爬行。我想知道问题出在哪里。如果我第二天尝试代码,它会再次工作几分钟并停止。好吧,它能工作,但它不会爬行。如果我改变了起始网址,它会重新开始工作,并以相同的代码再次停止。 这里怎么了?在

停下来后我看到的是:

^{pr2}$

以下是我正在使用的代码:

from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.http import TextResponse 
from scrapy.http import HtmlResponse


class Basketspider(CrawlSpider):
    name = "basketsp17"
    allowed_domains = ["www.euroleague.net"]
    start_urls = ["http://www.euroleague.net/main/results/by-date/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('main\/results\/showgame\?gamecode\=/\d$\&seasoncode\=E2013\#!boxscore')),follow=True),
        Rule(SgmlLinkExtractor(allow=()),callback='parse_item'),
    )

    def init_request(self):
        return HtmlResponse("http://www.euroleague.net/main/results/by-date/", body = body)

    def parse_item(self, response):
        sel = HtmlXPathSelector(response)
        items=[]
        item = BasketbaseItem()
        item['date'] = sel.select('//div[@class="gs-dates"]/text()').extract() # Game date
        item['time'] = sel.select('//div[@class="gs-dates"]/span[@class="GameScoreTimeContainer"]/text()').extract() # Game time

        items.append(item) 
        return items 

Tags: todebugcomhttpgetdatenetmain
1条回答
网友
1楼 · 发布于 2024-04-20 02:45:27

我修改了你的代码使之生效。变化

我看不出init_请求的目的,至少我不认为有人在调用它。在

重写CrawlSpider的解析并在将响应传递给基解析之前更改HtmlResponse的响应。在

再次在parse\u项中更改对HtmlResponse的响应

请理解,我们正在盲目地将响应转换为HtmlResponse。至少,您应该检查响应是否为“response”类型,如果可能,在将其转换为HtmlResponse之前检查body中的html标记(其他检查scrapy做了,但失败了)。而且,这种转换可能会在下载中间件中被巧妙地处理。如果可以尝试在process_Response方法中转换it响应,则该进程的响应是在spider的回调之前处理的。在

#from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.http import TextResponse
from scrapy.http import HtmlResponse


class Basketspider(CrawlSpider):
    name = "basketsp17"
    allowed_domains = ["www.euroleague.net"]
    start_urls = ["http://www.euroleague.net/main/results/by-date/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('main\/results\/showgame\?gamecode\=/\d$\&seasoncode\=E2013\#!boxscore')),follow=True),
        Rule(SgmlLinkExtractor(allow=()),callback='parse_item'),
    )  

    def init_request(self):
        print 'init request is called'
        return HtmlResponse("http://www.euroleague.net/main/results/by-date/", body = body)

    def parse(self,response):
        response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
        return super(Basketspider,self).parse(response)

    def parse_item(self, response):
        response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
        sel = HtmlXPathSelector(response)
        items=[]
        print 'parse item is called'
        #item = BasketbaseItem()
        #item['date'] = sel.select('//div[@class="gs-dates"]/text()').extract() # Game date
        #item['time'] = sel.select('//div[@class="gs-dates"]/span[@class="GameScoreTimeContainer"]/text()').extract() # Game time

        #items.append(item) 
        return items

我认为你的问题是网站不遵循标准和不使用身体来建立回应的结合。我认为我们应该把这个问题作为询问或问题提出来。在

相关问题 更多 >