使用Scrapy和Xpath选择包含另一个<div>的特定<div>中的文本内容

Question

编辑：解决了！对于那些在学习中遇到这个问题的人；答案在下面，由Paul详细解释。

这是我在这里的第一个问题，我搜索了很久（到目前为止已经两天了），但没有找到解决办法。我想抓取一个特定的零售网站，以获取产品名称和价格。

目前，我有一个爬虫在一个特定的零售网站上工作，但在另一个零售网站上，它有点儿能用。我可以正确获取产品名称，但价格的格式却不对。

首先，这是我现在的爬虫代码：

import scrapy

from projectname.items import projectItem

class spider_whatever(scrapy.Spider):
    name = "whatever"
    allowed_domain = ["domain.com"]
    start_urls = ["http://www.domain.com"]

    def parse(self, response):
        sel = scrapy.Selector(response)
        requests = sel.xpath('//div@class="container"]')
        product = requests.xpath('.//*[@class="productname"/text()]').extract()
        price = requests.xpath('.//*[@class="price"]').extract() #Issue lies here.

        itemlist = []
        for product, price in zip(product, price):
            item = projectItem()
            item['product'] = product.strip().upper()
            item['price'] = price.strip()
            itemlist.append(item)
        return itemlist

现在，价格的目标HTML是：

<div id="listPrice1" class="price">
                        $622                        <div class="cents">.00</div>
                    </div>

如你所见，这个结构不仅混乱，而且在我想要引用的div里面还有一个div。现在，当我尝试这样做时：

price = requests.xpath('.//*[@class="price"]/text()').extract()

它输出的是：

product,price
some_product1, $100
some_product2, 
some_product3, $200
some_product4,

而我本来希望输出的是：

product,price
some_product1, $100
some_product2, $200
some_product3, $300
some_product4, $400

我认为它的问题在于，它还提取了class为“cents”的div，并把它分配给下一个产品，这样就把下一个值往下推了一位。

当我尝试通过Google Docs电子表格抓取数据时，产品在一列中，而价格被分成了两列；第一列是美元金额，第二列是.00美分，如下所示：

product,price,cents
some_product1, $100, .00
some_product2, $200, .00
some_product3, $300, .00
some_product4, $400, .00

所以我的问题是，如何分离一个div中的div。我是否可以通过特定的方式在XPath中排除它，或者在解析数据时过滤掉它？如果可以过滤掉，我该怎么做呢？

任何帮助都非常感谢。请理解，我对Python相对较新，正在尽力学习。

暂无标签

使用Scrapy和Xpath选择包含另一个<div>的特定<div>中的文本内容

1 个回答

撰写回答