Python3 Beautifulsoup:获取带有特定文本的span标记值,该文本也随机放置在html树中

2024-05-15 02:06:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着在这里搜索,但老实说,找不到答案,因为使用Selenium应该很容易做到这一点,但由于性能是一个重要因素,所以我考虑使用Beautifulsoup

场景:我需要根据用户输入以随机方式生成的不同项目的价格,请参见下面的代码:

<div class="sk-expander-content" style="display: block;">

<ul>
  <li>
    <span>Third Party Liability</span>
    <span>€756.62</span>
  </li>

  <li>
  <span>Fire &amp; Theft</span>
  <span>€15.59</span>
  </li>

</ul>
</div>

如果这些选项是静态的,并且在html中总是显示在相同的位置,那么就很容易获取价格,但是由于这些选项可以放在div sk-expander-content中的任何位置,我不确定如何以动态的方式找到它们

最好的方法是编写一个方法,在我们正在寻找的跨度文本中传递,并以欧元返回值。span标记的结构始终相同,第一个span始终是项目的名称,第二个span始终是价格

首先想到的是以下代码,但我不确定这是否足够健壮,或者是否有意义:

html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

div_i_need = soup.find_all("div", class_="sk-expander-content")[1]

def price_scraper(text_to_find):
    for el in div_i_need.find_all(['ul', 'li', 'span']):
        if el.name == 'span':
            if el[0].text == text_to_find:
                return(el[1].text)

非常感谢你的帮助


Tags: 项目代码textdivhtml方式价格li
2条回答
from bs4 import BeautifulSoup
import re

html = """
<div class="sk-expander-content" style="display: block;">

<ul>
  <li>
    <span>Third Party Liability</span>
    <span>€756.62</span>
  </li>

  <li>
  <span>Fire &amp; Theft</span>
  <span>€15.59</span>
  </li>

</ul>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

target = soup.select("div.sk-expander-content")

for tar in target:
    data = [item.text for item in tar.findAll("span", text=re.compile("€"))]
    print(data)

输出:

['€756.62', '€15.59']

Note: I used select which return ResultSet in order to find all div.

使用正则表达式

import re

html='''<div class="sk-expander-content" style="display: block;">

<ul>
  <li>
    <span>Third Party Liability</span>
    <span>€756.62</span>
  </li>

  <li>
  <span>Fire &amp; Theft</span>
  <span>€15.59</span>
  </li>

</ul>
</div>
<div class="sk-expander-content" style="display: block;">

<ul>
  <li>
    <span>Fire &amp; Theft</span>
    <span>€756.62</span>
  </li>

  <li>
  <span>Third Party Liability</span> 
  <span>€15.59</span>
  </li>

</ul>
</div>'''

soup = BeautifulSoup(html, "html.parser")

for item in soup.find_all(class_="sk-expander-content"):

    for span in item.find_all('span',text=re.compile("€(\d+).(\d+)")):
        print(span.find_previous_sibling('span').text)
        print(span.text)

输出

Third Party Liability
€756.62
Fire & Theft
€15.59
Fire & Theft
€756.62
Third Party Liability
€15.59

更新: 如果要获取第一个节点值,请使用find()而不是find_all()

import re

html='''<div class="sk-expander-content" style="display: block;">

<ul>
  <li>
    <span>Third Party Liability</span>
    <span>€756.62</span>
  </li>

  <li>
  <span>Fire &amp; Theft</span>
  <span>€15.59</span>
  </li>

</ul>
</div>
<div class="sk-expander-content" style="display: block;">

<ul>
  <li>
    <span>Fire &amp; Theft</span>
    <span>€756.62</span>
  </li>

  <li>
  <span>Third Party Liability</span> 
  <span>€15.59</span>
  </li>

</ul>
</div>'''

soup = BeautifulSoup(html, "html.parser")

for span in soup.find(class_="sk-expander-content").find_all('span',text=re.compile("€(\d+).(\d+)")):
    print(span.find_previous_sibling('span').text)
    print(span.text)

相关问题 更多 >

    热门问题