我需要从这些相对URL中生成绝对URL。试图使用进程链接,但徒劳。有什么建议吗
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class FfySpider(CrawlSpider):
name = 'FFy'
allowed_domains = ['cartoon3rbi.net']
start_urls = ['https://www.cartoon3rbi.net/cats-pages-1.html/']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@class="cartoon_cat_name"]'), process_links='make_absolute_path',
callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'name': response.xpath('//div[@class="cartoon_eps_name"]/a/text()[2]').extract(),
}
def make_absolute_path(self, links):
for link in links:
url = 'https://www.cartoon3rbi.net/' + link
return url
来自scrapy文档: https://docs.scrapy.org/en/latest/topics/spiders.html#crawling-rules
我认为
process_links
指定的函数是用链接列表调用的,应该返回链接列表(或生成器)相关问题 更多 >
编程相关推荐