使用Scrapy解析网站，跟随下一页并写入XML

Question

我的脚本在我把一段代码注释掉时运行得很好：return items。

这是我的代码，我把链接改成了http://example.com，因为看起来其他人也是这样做的，可能是为了避免抓取数据的法律问题。

class Vfood(CrawlSpider):
        name = "example.com"
        allowed_domains = [ "example.com" ]
        start_urls = [
                "http://www.example.com/TV_Shows/Show/Episodes"
        ]

        rules = (
                Rule(SgmlLinkExtractor(allow=('example\.com', 'page='), restrict_xpaths = '//div[@class="paginator"]/
span[@id="next"]'), callback='parse'),
        )

        def parse(self, response):
                hxs = HtmlXPathSelector(response)
                items = []
                countries = hxs.select('//div[@class="index-content"]')
                tmpNextPage = hxs.select('//div[@class="paginator"]/span[@id="next"]/a/@href').extract()
                for country in countries:
                        item = FoodItem()
                        countryName = country.select('.//h3/text()').extract()
                        item['country'] = countryName
                        print "Country Name: ", countryName
                        shows = country.select('.//div[@class="content1"]')
                        for show in shows.select('.//div'):
                                showLink = (show.select('.//h4/a/@href').extract()).pop()
                                showLocation = show.select('.//h4/a/text()').extract()
                                showText = show.select('.//p/text()').extract()
                                item['showURL'] = "http://www.travelchannel.com"+str(showLink)
                                item['showcity'] = showLocation
                                item['showtext'] = showText
                                item['showtext'] = showText
                                print "\t", showLink
                                print "\t", showLocation
                                print "\t", showText
                                print "\n"
                                items.append(item)
                        **#return items**

                for NextPageLink in tmpNextPage:
                        m = re.search("Location", NextPageLink)
                        if m:
                                NextPage = NextPageLink
                                print "Next Page:  ", NextPage
                                yield Request("http://www.example.com/"+NextPage, callback = self.parse)
                        else:
                                NextPage = 'None'
SPIDER = food()

如果我把 #return items 这行取消注释，我就会遇到以下错误：

yield Request("http://www.example.com/"+NextPage, callback = self.parse)
SyntaxError: 'return' with argument inside generator

由于我保留了注释，我无法以XML格式收集数据，但通过打印出来的结果，我确实在屏幕上看到了我应该看到的所有内容。

我用来获取XML的命令是：

scrapy crawl example.com --set FEED_URI=food.xml --set FEED_FORMAT=xml

当我取消注释return items这一行时，我可以创建XML文件，但脚本就停止了，无法继续跟踪链接。

错误处理编程调试 xml格式数据抓取 scrapy 数据收集网站解析链接跟踪

使用Scrapy解析网站，跟随下一页并写入XML

3 个回答

撰写回答