Scrapy:递归爬网生成DEBUG:Crawled(200)并且没有项输出

2024-03-28 19:45:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图让我的第一个垃圾递归蜘蛛运行在一个非常简单的站点上,但是却遇到了DEBUG:Crawled(200)问题,JSON文件中什么也没有。在

我从网上找了一个例子试着。我真的不知道问题出在哪里。有人能帮我吗?在

蜘蛛代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class rgfMedlem(CrawlSpider):
    name = "rgfMedlem"
    allowed_domains = ["rgf.no"]
    start_urls = ["http://rgf.no/medlem/index.php"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('index.php', ))),

        Rule(SgmlLinkExtractor(allow=('\?s=', )), callback='parse_item'),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//span[@class="innhold"]/table/tr')
        items = []
        item = SasItem()

        for row in rows:
            print "har ar jag"
            item['agent'] = row.select('td/b/text()').extract()
            item['org'] = row.select('td/b/text()').extract()
            item['link'] = rows.select('td/a/@href').extract()
            item['produkt'] = rows.select('td/b/text()').extract()
            items.append(item)

        return items

爬行器爬网日志文件

^{pr2}$

Tags: 文件textfromimportextractitemsitemrule
1条回答
网友
1楼 · 发布于 2024-03-28 19:45:29

所以基本上你的regex并不完全正确,你的Xpath也需要一些调整。我认为下面的代码符合您的要求,所以请尝试一下,如果您需要更多帮助,请告诉我。在

注意:这段代码是用scray0.22.2编写和测试的

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class rgfMedlemSpider(CrawlSpider):
    name = "rgfMedlem"
    allowed_domains = ["rgf.no"]
    start_urls = ["http://rgf.no/medlem/index.php"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('(\?q=*)|(\?s=\d+&q=*)', )), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        sel = Selector(response)
        rows = sel.xpath('//span[@class="innhold"]/table/tr')
        items = []

        for row in rows[1:]:
            item = SasItem()
            item['agent'] = row.xpath('./td[1]/a/text()|./td[1]/text()').extract()
            item['org'] = row.xpath('./td[2]/text()').extract()
            item['link'] = row.xpath('./td[1]/a/@href').extract()
            item['produkt'] = row.xpath('./td[3]/text()').extract()
            items.append(item)
        return items

相关问题 更多 >