如何绕过<hr>标记从html块内部检索价格

2024-04-26 22:53:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我在用Python和BeautifulSoup。你知道吗

我有一个包含以下几个html块的页面:

<div class="col-sm-6 col-md-3"> <div class="thumbnail box-hover thumb-article-product"> <div class="ProductPicWrapper"> <div class="test"> <a href="product_info.php?products_id=5055856419716"><img width="120" height="120" src="https://cdn.smartoys.be/catalog/images/thumbs/120_120/products/x5055856419716.JPG.pagespeed.ic.uX1UW7-Gxw.webp" title="The Elder Scrolls Online : Summerset" alt="The Elder Scrolls Online : Summerset" class="img-responsive" data-pagespeed-url-hash="3637984103" onload="pagespeed.CriticalImages.checkImageForCriticality(this);"/></a> </div> </div> <div class="caption"> <p class="text-center nameart"><a href="product_info.php?products_id=5055856419716">The Elder Scrolls Online : Summerset</a></p> <p class="group inner list-group-item-heading nameart-ean text-center">5055856419716</br>Playstation 4</p> <hr> <p class="text-center article-price article-price-used ">Dès <span itemprop="price">10<span class="product-price-sm">.00&euro;</span></span></p> <p class="text-center"> <span class="label"></span> </p> <div class="text-center"> <div class="btn-group"> <a href="product_info.php?products_id=5055856419716" class="btn btn-danger" role="button">Voir le produit</a> </div> </div> </div> </div></div>

我想找回价格。我从Chrome本地保存的一个页面上成功地实现了这一点,但是当直接在线获取时,html代码就大不相同了。你知道吗

从下载的页面中,我只做了以下操作来获取价格(为了简单起见,去掉了循环):

productblocks = soup.find_all("div",{"class": "col-sm-6 col-md-3"})
gameprice = productblocks[i].find("p", {"class": "text-center article-price article-price-used "}).text.encode('utf-8').strip()[:-3].replace('Dès ','')

但是,在对联机页面执行此操作时,以下代码不包括price部分:

productblocks = soup.find_all("div",{"class": "col-sm-6 col-md-3"})

我设法得到的名称,代码等,但似乎是价格部分丢失。你知道吗

print productblocks[0]

退货:

<div class="col-sm-6 col-md-3"> <div class="thumbnail box-hover thumb-article-product"> <div class="ProductPicWrapper"> <div class="test"> <a href="product_info.php?products_id=5055856419716"><img alt="The Elder Scrolls Online : Summerset" class="img-responsive" height="120" src="https://cdn.smartoys.be/catalog/images/thumbs/120_120/products/x5055856419716.JPG.pagespeed.ic.CdYmLZol8V.jpg" title="The Elder Scrolls Online : Summerset" width="120"/></a> </div> </div> <div class="caption"> <p class="text-center nameart"><a href="product_info.php?products_id=5055856419716">The Elder Scrolls Online : Summerset</a></p><p class="group inner list-group-item-heading nameart-ean text-center">5055856419716</p></div></div></div>

很明显没有价格部分。我做错什么了?你知道吗

谢谢你的帮助。你知道吗


Tags: thetextdivarticlecolproductpriceclass
2条回答

Beautiful Soup无法在html中的hr标记之后进行分析。你可以试试这个来得到价格值。你知道吗

演示:

from bs4 import BeautifulSoup
s = """<div class="col-sm-6 col-md-3"> <div class="thumbnail box-hover thumb-article-product"> <div class="ProductPicWrapper"> <div class="test"> <a href="product_info.php?products_id=5055856419716"><img width="120" height="120" src="https://cdn.smartoys.be/catalog/images/thumbs/120_120/products/x5055856419716.JPG.pagespeed.ic.uX1UW7-Gxw.webp" title="The Elder Scrolls Online : Summerset" alt="The Elder Scrolls Online : Summerset" class="img-responsive" data-pagespeed-url-hash="3637984103" onload="pagespeed.CriticalImages.checkImageForCriticality(this);"/></a> </div> </div> <div class="caption"> <p class="text-center nameart"><a href="product_info.php?products_id=5055856419716">The Elder Scrolls Online : Summerset</a></p> <p class="group inner list-group-item-heading nameart-ean text-center">5055856419716</br>Playstation 4</p> <hr> <p class="text-center article-price article-price-used ">Dès <span itemprop="price">10<span class="product-price-sm">.00&euro;</span></span></p> <p class="text-center"> <span class="label"></span> </p> <div class="text-center"> <div class="btn-group"> <a href="product_info.php?products_id=5055856419716" class="btn btn-danger" role="button">Voir le produit</a> </div> </div> </div> </div></div>"""

soup = BeautifulSoup(s, "html.parser")
productblocks = soup.find_all("div",{"class": "col-sm-6 col-md-3"})
print( productblocks[0].find("p", class_="group inner list-group-item-heading nameart-ean text-center").findNext("p").text.encode('utf-8').strip()[:-3].replace('Dès ','') )

输出:

10.00
  • hr之前找到p标签,然后使用findNext("p")获取价格标签。你知道吗

有更简单的方法(a包含您的HTML):

import re
re.findall( r'span itemprop="price">(\d+)<span', a )
['10']

相关问题 更多 >