Xpath和scrapy:我得到了所有的东西上百次

2024-06-06 05:15:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用带有Xpath的Scrapy1.2(当然还有:Python3.4)来阅读上的热门100图表billboard.com网站. 当我使用代码中的第二个选项时,每首歌都会有100个标题。我知道那是因为双重原因,但我不能让第一种选择奏效。我怎样才能确保每首歌的标题都是正确的呢?你知道吗

class MusicalSpider(scrapy.Spider):
    name = "musicalspider"
    allowed_domains = ["billboard.com"]
    start_urls = ['http://www.billboard.com/charts/hot-100/']

    def parse(self, response):
        songs = response.xpath('//div[@class="chart-data js-chart-data"]/div[@class="container"]/article')

        for song in songs:
            item = MusicItem()
            # first option:
            item['title'] = song.xpath('div[@class="chart-row__primary"]/div[@class="chart-row__main-display"]/div[@class="chart-row__container"]/div[@class="chart-row__title"]/h2[@class="chart-row__song"]').extract()
            # second option:
            item['title'] = song.xpath('//h2[@class="chart-row__song"]').extract()

            yield item

Tags: divcom标题datasongtitleresponsecontainer
1条回答
网友
1楼 · 发布于 2024-06-06 05:15:56

这是一个很常见的问题。记住用一个点开始内部循环XPath表达式-这将使它们成为特定于上下文的:

for song in songs:
    item = MusicItem()
    # first option:
    item['title'] = song.xpath('.//div[@class="chart-row__primary"]/div[@class="chart-row__main-display"]/div[@class="chart-row__container"]/div[@class="chart-row__title"]/h2[@class="chart-row__song"]').extract()
    # second option:
    item['title'] = song.xpath('.//h2[@class="chart-row__song"]').extract()

    yield item

更多信息请访问:


这是为我工作的蜘蛛:

import scrapy

class MusicalSpider(scrapy.Spider):
    name = "musicalspider"
    allowed_domains = ["billboard.com"]
    start_urls = ['http://www.billboard.com/charts/hot-100/']

    def parse(self, response):
        songs = response.xpath('//div[@class="chart-data js-chart-data"]/div[@class="container"]/article')

        for song in songs:
            item = MusicItem()
            item['title'] = song.xpath('.//h2[@class="chart-row__song"]/text()').extract_first()
            yield item

它产生以下项目:

{'title': u'Black Beatles'}
{'title': u'Closer'}
...
{'title': u'Hold Up'}
{'title': u'Gangsta'}

相关问题 更多 >