无法从Amazon中获取产品标题

2024-06-09 08:58:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Scrapy获取this Amazon website上产品的价格和名称。没有问题提取的价格,但我有问题的标题。区别在于我在class属性中看到了“aria hidded=true”。这是一个例子

<div class="p13n-sc-truncated" aria-hidden="true" data-rows="2" title="Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine With Handle, 26 Pounds in 24 Hours, 9 Ice Cubes Ready in 7 minutes, With Ice Scoop and Basket">Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…</div>

以下是css选择器命令:

title = response.css('.p13n-sc-truncated').css('::text').extract()

我可以知道CSS选择器应该是什么来提取文本吗。谢谢


Tags: selfdivtruetitle价格cssautomaticclass
3条回答

如果您查看html源代码(ctrl+u),您将看到产品标题确实有另一个类p13n-sc-line-clamp-2,它工作得非常好。因此,您的css选择器可以如下所示:

response.css('.p13n-sc-line-clamp-2::text').get().strip()

下面是一个简单的工作示例:

from scrapy.spiders import CrawlSpider

class amaSpider(CrawlSpider):
    name = 'amatitle'
    start_urls = ['https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/']

    def parse(self, response):
        yield{'title': response.css('.p13n-sc-line-clamp-2::text').get().strip()}

如果要提取所有标题并将其从前导和尾随空格中删除,请将解析函数更改为以下内容:

    def parse(self, response):
        titles = response.css('.p13n-sc-line-clamp-2::text').getall()
        titles_strip = [x.strip() for x in titles]
        yield{'titles': titles_strip}

您的代码很好:

>>> from parsel import Selector
>>> selector = Selector(text='<div class="p13n-sc-truncated" aria-hidden="true" data-rows="2" title="Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine With Handle, 26 Pounds in 24 Hours, 9 Ice Cubes Ready in 7 minutes, With Ice Scoop and Basket">Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…</div>')
>>> selector.css('.p13n-sc-truncated').css('::text').extract()
['Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…']

我的猜测是,响应不包含预期的HTML。如果这是亚马逊,那是极有可能的。他们有相当多的反机器人措施

您可以通过XPATH来解决这个问题。 转到xpather并将html发送到那里并提取xpath模式

import scrapy
from scrapy import Spider
class SSDSpider(scrapy.Spider):
    name = "SSD_spider"
    start_urls = ['https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0']
    DOWNLOAD_DELAY = 10
    def parse(self, response):
        yield {
                'title': response.xpath('//div[@class="p13n-sc-truncated"][1]').extract(),
              }

enter image description here

尝试使用漂亮的汤:

pip install beautifulsoup4
pip install lxml 
apt-get install python-lxml

Beautiful Soup还依赖于解析器,默认为lxml

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://your_amazon_link/product/').read()
soup = bs.BeautifulSoup(source,'lxml')
for title in soup.select("ol#zg-ordered-list > li"):
    title_name = title.select_one(".p13n-sc-truncated").get_text()
    print(title_name)

相关问题 更多 >