《纽约时报》今日头条

import scrapy class NewYorkSpider(scrapy.Spider): name = "times" start_urls = [ "https://www.nytimes.com/column/learning-word-of-the-day" ] # entry point for the spider def parse(self,response): for href in response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]'): url = href.extract() yield scrapy.Request(url, callback=self.parse_item) def parse_item(self, response): word = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "story-subheading", " " ))]//strong').extract()[0]

2条回答

网友

1楼 · 编辑于 2024-05-16 19:56:51

您正在.css方法中使用xpath表达式，该方法用于css选择器表达式。
只需将.css替换为.xpath：

response.css('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]')
# to
response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]')

关于第二个错误-提取的url不是绝对url，例如/some/sub/page.html。要将其转换为绝对url，可以使用response.urljoin()函数：

^{pr2}$

关于你的第三个错误-你的xpath在这里有问题。看起来您使用了一些xpath生成器，这些东西很少生成任何有价值的东西。您在这里寻找的只是一个<a>节点和story-link类：

^{3}$

对于您的单词xpath，您只需在node下使用文本，该节点位于：

word = response.xpath("//h4/strong/text()").extract_first()

网友

2楼 · 编辑于 2024-05-16 19:56:51

这个代码应该有效。为了从每个单词的网站上获取所需的其他信息，只需将适当的选择器与XPath或CSS表达式结合使用。在

关于选择器的更多信息，我推荐this站点，当然还有{a2}。在

import scrapy

class NewYorkSpider(scrapy.Spider):
    name = "times"
    start_urls = ["https://www.nytimes.com/column/learning-word-of-the-day"]

    # entry point for the spider
    def parse(self,response):
        for href in response.css('a[class="story-link"]::attr(href)'):
            yield scrapy.Request(href.extract(), callback=self.parse_item)

    def parse_item(self, response):
        heading = response.css('h4[class="story-subheading story-content"] strong::text').extract_first()

相关问题更多 >

编程相关推荐

热门问题

热门文章