我用scrapy编写了一个脚本,可以递归地对网站进行爬网。但出于某种原因,它不能。我在sublime中测试了XPath,它运行得非常好。所以,在这一点上,我不能纠正我做错了什么。你知道吗
““项目.py“包括:
import scrapy
class CraigpItem(scrapy.Item):
Name = scrapy.Field()
Grading = scrapy.Field()
Address = scrapy.Field()
Phone = scrapy.Field()
Website = scrapy.Field()
蜘蛛叫craigsp.py公司“包括:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class CraigspSpider(CrawlSpider):
name = "craigsp"
allowed_domains = ["craigperler.com"]
start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler']
rules=[Rule(LinkExtractor(restrict_xpaths='//area')),
Rule(LinkExtractor(restrict_xpaths='//a[@class="jeweler__link"]'),callback='parse_items')]
def parse_items(self, response):
page = response.xpath('//div[@class="page__content"]')
for titles in page:
AA= titles.xpath('.//h1[@class="page__heading"]/text()').extract()
BB= titles.xpath('.//p[@class="appraiser__grading"]/strong/text()').extract()
CC = titles.xpath('.//p[@class="appraiser__hours"]/text()').extract()
DD = titles.xpath('.//p[@class="appraiser__phone"]/text()').extract()
EE = titles.xpath('.//p[@class="appraiser__website"]/a[@class="appraiser__link"]/@href').extract()
yield {'Name':AA,'Grading':BB,'Address':CC,'Phone':DD,'Website':EE}
我运行的命令是:
scrapy crawl craigsp -o items.csv
希望有人能把我引向正确的方向。你知道吗
此错误意味着排队等待scrapy的url未通过
allowed_domains
设置。你知道吗您有:
你的蜘蛛正试图爬http://ww.americangemsociety.org。您要么需要将其添加到
allowed_domains
列表中,要么完全取消此设置。你知道吗相关问题 更多 >
编程相关推荐