无法找到页面

2024-06-08 14:03:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下所示的spider,但它似乎没有到达函数parse。有没有人能看一下,让我知道我是否遗漏了什么。我是否正确实现了SgmlLinkExtractor?你知道吗

蜘蛛应该从左侧边栏中挑选出所有链接,从中创建一个请求,然后解析下一页的facebook链接。它还应该为SgmlLinkExtractor中指定的其他页面执行此操作。目前,spider正在运行,但没有解析任何页面。你知道吗

class PrinzSpider(CrawlSpider):
    name = "prinz"
    allowed_domains = ["prinzwilly.de"]
    start_urls = ["http://www.prinzwilly.de/"]

    rules = (
        Rule(
            SgmlLinkExtractor(
                allow=(r'veranstaltungen-(.*)', ),
            ),
            callback='parse'
            ),
        )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        startlinks = hxs.select("//ul[@id='mainNav2']/li/a")
        print startlinks
        for link in startlinks:
            giglink = link.select('@href').extract()
            item = GigItem()
            item['gig_link'] = giglink
            request = Request(item['gig_link'], callback='parse_gig_page')
            item.meta['item'] = item
            yield request

    def parse_gig_page(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.meta['item']
        gig_content = hxs.select("//div[@class='n']/table/tbody").extract()
        fb_link = re.findall(r'(?:www.facebook.com/)(.*)', gig_content)
        print '********** FB LINK ********', fb_link
        return item

编辑

settings.py

BOT_NAME = 'gigscraper'

SPIDER_MODULES = ['gigscraper.spiders']
NEWSPIDER_MODULE = 'gigscraper.spiders'

ITEM_PIPLINES = ['gigscraper.pipelines.GigscraperPipeline']

items.py

from scrapy.item import Item, Field

class GigItem(Item):
    gig_link = Field()

pipelines.py

class GigscraperPipeline(object):
    def process_item(self, item, spider):
        print 'here I am in the pipeline'
        return item

Tags: selfparseresponsedeflinkitemselectclass
1条回答
网友
1楼 · 发布于 2024-06-08 14:03:24

两个问题:

  • extract()返回一个列表,您缺少[0]
  • 请求的回调不应是字符串,请使用self.parse_gig_page

以下是修改后的代码(工作):

import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector


class GigItem(Item):
    gig_link = Field()


class PrinzSpider(CrawlSpider):
    name = "prinz"
    allowed_domains = ["prinzwilly.de"]
    start_urls = ["http://www.prinzwilly.de/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'veranstaltungen-(.*)',)), callback='parse'),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        startlinks = hxs.select("//ul[@id='mainNav2']/li/a")
        for link in startlinks:
            item = GigItem()
            item['gig_link'] = link.select('@href').extract()[0]
            yield Request(item['gig_link'], callback=self.parse_gig_page, meta={'item': item})

    def parse_gig_page(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.meta['item']
        gig_content = hxs.select("//div[@class='n']/table/tbody").extract()[0]
        fb_link = re.findall(r'(?:www.facebook.com/)(.*)', gig_content)
        print '********** FB LINK ********', fb_link
        return item

希望有帮助。你知道吗

相关问题 更多 >