Scrapy无法分析链接

2024-05-14 19:06:02 发布

您现在位置:Python中文网/ 问答频道 /正文

无法正确收集链接。继续从页面获取部分链接。 如何让我的解析器工作

import scrapy


class GlobaldriveruSpider(scrapy.Spider):
    name = 'globaldriveru'
    allowed_domains = ['globaldrive.ru']
    start_urls = ['https://globaldrive.ru/moskva/motory/?items_per_page=500']

    def parse(self, response):
        links = response.xpath('//div[@class="ty-grid-list__item-name"]/a/@href').get()
        for link in links:
            yield scrapy.Request(response.urljoin(link), callback=self.parse_products, dont_filter=True)
            #yield scrapy.Request(link, callback=self.parse_products, dont_filter=True)

    def parse_products(self, response):
     #       for parse_products in response.xpath('//div[contains(@class, "container-fluid  products_block_page")]'):
        item = dict()
        item['title'] = response.xpath('//h1[@class="ty-product-block-title"]/text()').extract_first()
        yield item

下面是一些输出日志

[]
2019-04-29 16:21:12 [scrapy.core.engine] INFO: Spider opened
2019-04-29 16:21:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-29 16:21:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-04-29 16:21:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://globaldrive.ru/robots.txt> (referer: None)
2019-04-29 16:21:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://globaldrive.ru/moskva/motory/?items_per_page=500> (referer: None)
2019-04-29 16:21:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://globaldrive.ru/h/> from <GET https://globaldrive.ru/h>
2019-04-29 16:21:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://globaldrive.ru/-/> from <GET https://globaldrive.ru/->
2019-04-29 16:21:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://globaldrive.ru/%d0%b9/> from <GET https://globaldrive.ru/%D0%B9>
2019-04-29 16:21:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://globaldrive.ru/%d1%80/> from <GET https://globaldrive.ru/%D1%80>

Tags: httpsdebugselfgetparseresponseruitems
1条回答
网友
1楼 · 发布于 2024-05-14 19:06:02

parse函数中的.extract()替换.get(),现在您正在逐个字母迭代链接,但只需提取所有链接

def parse(self, response):
    links = response.xpath('//div[@class="ty-grid-list__item-name"]/a/@href').extract()  # <- here
    for link in links:
        yield scrapy.Request(response.urljoin(link), self.parse_products)

相关问题 更多 >

    热门问题