从爬网spid中的相对url生成绝对路径

2024-04-26 12:27:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从这些相对URL中生成绝对URL。试图使用进程链接,但徒劳。有什么建议吗

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class FfySpider(CrawlSpider):
    name = 'FFy'
    allowed_domains = ['cartoon3rbi.net']
    start_urls = ['https://www.cartoon3rbi.net/cats-pages-1.html/']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="cartoon_cat_name"]'), process_links='make_absolute_path',
             callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {

            'name': response.xpath('//div[@class="cartoon_eps_name"]/a/text()[2]').extract(),
        }

    def make_absolute_path(self, links):
        for link in links:
            url = 'https://www.cartoon3rbi.net/' + link
            return url

Tags: namefromhttpsimportdivurlnetwww
1条回答
网友
1楼 · 发布于 2024-04-26 12:27:24

来自scrapy文档: https://docs.scrapy.org/en/latest/topics/spiders.html#crawling-rules

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

我认为process_links指定的函数是用链接列表调用的,应该返回链接列表(或生成器)

    def make_absolute_path(self, links):
        for link in links:
            url = 'https://www.cartoon3rbi.net/' + link
            yield url

相关问题 更多 >