python scrapy css选择器名称提取不起作用

2024-04-18 13:59:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用css选择器从http://www.bschool.careers360.com/search/all/bangalore中提取大学名称,但没有提取数据。”已设置“ROBOTSTXT_OBEY=False”。更改后我的代码如下。但结果还是一样

import scrapy

class BloreSpider(scrapy.Spider):
    name = 'blore'
    start_urls = ['http://www.engineering.careers360.com/search/college/bangalore']

    def parse(self, response):
        for quote in response.css('div.title'):
            yield {
                'author': quote.xpath('.//a/text()').extract_first(),
            }

        next_page = response.css('li.pager-next a::attr("href")').extract_first()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

日志是

^{pr2}$

Tags: selfcomhttpsearchparseresponsewwwpage
1条回答
网友
1楼 · 发布于 2024-04-18 13:59:26

xpath需要与您的quote节点相对,换句话说,您需要在//之前添加.。在

试试这个:

def parse(self, response):
    for quote in response.css('div.title'):
        yield {
            #'author': quote.xpath('//a/text()').extract_first(),
            #                       ^
            'author': quote.xpath('.//a/text()').extract_first(),
        }

    next_page = response.css('li.pager-next a::attr("href")').extract_first()
    # if next_page is not None:
    if next_page:  # you can also just do this
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

编辑:查看您提供的日志,您在尝试检索时似乎得到了404机器人.txt. 尝试在settings.py中设置ROBOTS_TXT_OBEY = False

相关问题 更多 >