当两个规则为s时,Scrapy无法递归爬网

2024-06-17 10:58:03 发布

您现在位置:Python中文网/ 问答频道 /正文

I've tried to upload what I can see in the console我用scrapy编写了一个脚本,可以递归地对网站进行爬网。但出于某种原因,它不能。我在sublime中测试了XPath,它运行得非常好。所以,在这一点上,我不能纠正我做错了什么。你知道吗

““项目.py“包括:

import scrapy
class CraigpItem(scrapy.Item):
    Name = scrapy.Field()
    Grading = scrapy.Field()
    Address = scrapy.Field()
    Phone = scrapy.Field()
    Website = scrapy.Field()

蜘蛛叫craigsp.py公司“包括:

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor

class CraigspSpider(CrawlSpider):
    name = "craigsp"
    allowed_domains = ["craigperler.com"]
    start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
    rules=[Rule(LinkExtractor(restrict_xpaths='//area')),
               Rule(LinkExtractor(restrict_xpaths='//a[@class="jeweler__link"]'),callback='parse_items')]    

    def parse_items(self, response):
        page = response.xpath('//div[@class="page__content"]')
        for titles in page:
            AA= titles.xpath('.//h1[@class="page__heading"]/text()').extract()
            BB= titles.xpath('.//p[@class="appraiser__grading"]/strong/text()').extract()
            CC = titles.xpath('.//p[@class="appraiser__hours"]/text()').extract()
            DD = titles.xpath('.//p[@class="appraiser__phone"]/text()').extract()
            EE = titles.xpath('.//p[@class="appraiser__website"]/a[@class="appraiser__link"]/@href').extract()
            yield {'Name':AA,'Grading':BB,'Address':CC,'Phone':DD,'Website':EE}

我运行的命令是:

scrapy crawl craigsp -o items.csv

希望有人能把我引向正确的方向。你知道吗


Tags: textpyimportfieldpageextractitemsrule
1条回答
网友
1楼 · 发布于 2024-06-17 10:58:03

Filtered offsite request

此错误意味着排队等待scrapy的url未通过allowed_domains设置。你知道吗

您有:

allowed_domains = ["craigperler.com"]

你的蜘蛛正试图爬http://ww.americangemsociety.org。您要么需要将其添加到allowed_domains列表中,要么完全取消此设置。你知道吗

相关问题 更多 >