使用Scrapy和Xpath选择包含另一个<div>的特定<div>中的文本内容
编辑:解决了!对于那些在学习中遇到这个问题的人;答案在下面,由Paul详细解释。
这是我在这里的第一个问题,我搜索了很久(到目前为止已经两天了),但没有找到解决办法。我想抓取一个特定的零售网站,以获取产品名称和价格。
目前,我有一个爬虫在一个特定的零售网站上工作,但在另一个零售网站上,它有点儿能用。我可以正确获取产品名称,但价格的格式却不对。
首先,这是我现在的爬虫代码:
import scrapy
from projectname.items import projectItem
class spider_whatever(scrapy.Spider):
name = "whatever"
allowed_domain = ["domain.com"]
start_urls = ["http://www.domain.com"]
def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div@class="container"]')
product = requests.xpath('.//*[@class="productname"/text()]').extract()
price = requests.xpath('.//*[@class="price"]').extract() #Issue lies here.
itemlist = []
for product, price in zip(product, price):
item = projectItem()
item['product'] = product.strip().upper()
item['price'] = price.strip()
itemlist.append(item)
return itemlist
现在,价格的目标HTML是:
<div id="listPrice1" class="price">
$622 <div class="cents">.00</div>
</div>
如你所见,这个结构不仅混乱,而且在我想要引用的div里面还有一个div。现在,当我尝试这样做时:
price = requests.xpath('.//*[@class="price"]/text()').extract()
它输出的是:
product,price
some_product1, $100
some_product2,
some_product3, $200
some_product4,
而我本来希望输出的是:
product,price
some_product1, $100
some_product2, $200
some_product3, $300
some_product4, $400
我认为它的问题在于,它还提取了class为“cents”的div,并把它分配给下一个产品,这样就把下一个值往下推了一位。
当我尝试通过Google Docs电子表格抓取数据时,产品在一列中,而价格被分成了两列;第一列是美元金额,第二列是.00美分,如下所示:
product,price,cents
some_product1, $100, .00
some_product2, $200, .00
some_product3, $300, .00
some_product4, $400, .00
所以我的问题是,如何分离一个div中的div。我是否可以通过特定的方式在XPath中排除它,或者在解析数据时过滤掉它?如果可以过滤掉,我该怎么做呢?
任何帮助都非常感谢。请理解,我对Python相对较新,正在尽力学习。
1 个回答
4
让我们来看看几种不同的XPath模式:
>>> import scrapy
>>> selector = scrapy.Selector(text="""<div id="listPrice1" class="price">
... $622 <div class="cents">.00</div>
... </div>""")
# /text() will select all text nodes under the context not,
# here any element with class "price"
# there are 2 of them
>>> selector.xpath('.//*[@class="price"]/text()').extract()
[u'\n $622 ', u'\n ']
# if you wrap the context node inside the "string()" function,
# you'll get the string representation of the node,
# basically a concatenation of text elements
>>> selector.xpath('string(.//*[@class="price"])').extract()
[u'\n $622 .00\n ']
# using "normalize-space()" instead of "string()",
# it will replace multiple space with 1 space character
>>> selector.xpath('normalize-space(.//*[@class="price"])').extract()
[u'$622 .00']
# you could also ask for the 1st text node under the element with class "price"
>>> selector.xpath('.//*[@class="price"]/text()[1]').extract()
[u'\n $622 ']
# space-normalized version of that may do what you want
>>> selector.xpath('normalize-space(.//*[@class="price"]/text()[1])').extract()
[u'$622']
>>>
所以,最后你可能会想要这个模式:
def parse(self, response):
sel = scrapy.Selector(response)
requests = sel.xpath('//div@class="container"]')
itemlist = []
for r in requests:
item = projectItem()
item['product'] = r.xpath('normalize-space(.//*[@class="productname"])').extract()
item['price'] = r.xpath('normalize-space(.//*[@class="price"]/text()[1])').extract()
itemlist.append(item)
return itemlist