在Python中使用BeautifulSoup4删除html并区分相同的标记

2024-04-24 23:32:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Python中的bs4刮取html,其中包含重复的相同标记,这些标记包含我想要的数据。我要收集的数据包括 class="tip_date_time", class="tip_wave" and class="tip_train".

到目前为止,我在Python中完成了以下工作:

soup = BeautifulSoup(res.content, 'html.parser')
html = soup.find_all("div", {"class": "forecast_tip"}) 

dateCond = []
for date in html:
    for text in date.find_all("div", {"class": "tip_date_time"}):
        dateCond.append(text.getText())

waveCond = []
for wave in html:
    for text in wave.find_all("span", {"class": "tip_wave"}):
        waveCond.append(text.getText())

这将为我打算根据索引排序的每个刮片创建单独的列表。因此dateCond[0]将与waveCond[0]对齐。这可以正常工作,因为每个列表都有相同数量的项目

然而,我遇到了一个问题“提示火车”,因为这可能会从1个条目变化到3个条目,具体取决于日期。因此,如果我使用相同的代码,我可能会有一个与其他列表长度不同的列表,并放弃排序

因此,我希望能够仅选择“tip_train”的前2个实例,因为它位于“tip_date_time”div的每个块中。我不能仅选择前2个被刮取的实例,因为我希望每天的前2个实例

Html代码如下:

    <div class="forecast_tip">
    <div class="tip_date_time">6am Mon 21 Sep</div>

      <div class="tip_surf">
      <span class="tip_wave">2ft ENE</span>
      <span class="tip_wind">7kt NNW</span>
    </div>
    <div class="tip_description">(Waist-Shoulder High)</div>
  
  
      <div class="tip_train">1.3m @ 7.5s ENE (64&deg;)</div>
        <div class="tip_train">0.4m @ 13.1s SSE (167&deg;)</div>
        <div class="tip_train">0.3m @ 13.8s SSW (194&deg;)</div>
  
  <div class="tip_tides">
          <div class="tip_tide">
        <span class="tip_tide_label">Low:</span>
        <span class="tip_tide_value">Sun 4:29pm (0.20m)</span>
      </div>
    
          <div class="tip_tide">
        <span class="tip_tide_label">High:</span>
        <span class="tip_tide_value">Sun 10:40pm (1.67m)</span>
      </div>
      </div>
</div>

编辑: 在下面的响应之后,我编辑添加了[0]和[1]索引,并替换为find_all,这允许我访问“div”“tip_train”的第二个(可能是第三个)实例。i、 e.一次和二次涌浪

url = "https://www.swellnet.com/reports/australia/new-south-wales/northern-beaches/forecast"
res = requests.get(url)
res.raise_for_status
soup = BeautifulSoup(res.content, 'html.parser')
forecast = soup.find_all("div", {"class": "forecast_tip"}) # scrapes the swell train block of code for the whole div tag that includes class forecast_tip. will ouput 9 items (3 days x 6am, 12pm, 6pm)

def getData(html, attribute, _class, index):
    result = []
    for tag in html:
        for item in tag.find_all(attribute, {"class": _class})[index]:
            if item is not None:
                result.append(item)
            else:
                result.append("N/A")
    return result

date = getData(forecast, "div", "tip_date_time", 0)
train1 = getData(forecast, "div", "tip_train", 0)
train2 = getData(forecast, "div", "tip_train", 1)
wave = getData(forecast, "span", "tip_wave", 0)

logging.debug(date)
logging.debug(train1)
logging.debug(train2)
logging.debug(wave)

forecast_data = list(zip(date, train1, train2, wave))
headers = ["Date", "Primary Swell", "Secondary Swell", "Wave Height"]

print(tabulate([*forecast_data], headers=headers))

结果如下:

Date             Primary Swell            Secondary Swell          Wave Height
---------------  -----------------------  -----------------------  -------------
6am Wed 23 Sep   0.6m @ 8.3s NE (54°)     0.2m @ 13s SSW (195°)    1ft NE
12pm Wed 23 Sep  0.5m @ 8.4s NE (54°)     0.2m @ 12.3s SSW (194°)  1ft NE
6pm Wed 23 Sep   0.4m @ 8.4s NE (56°)     0.2m @ 11.1s SSW (200°)  1ft NE
6am Thu 24 Sep   0.4m @ 10.1s SSW (204°)  0.2m @ 9.9s ENE (77°)    0.5ft SSW
12pm Thu 24 Sep  0.6m @ 10.1s SSW (205°)  0.3m @ 9.8s ENE (73°)    1ft SSW
6pm Thu 24 Sep   0.7m @ 9.9s SSW (203°)   0.2m @ 9.8s ENE (77°)    1ft SSW
6am Fri 25 Sep   0.6m @ 9.1s SSW (197°)   0.2m @ 12.5s SSE (165°)  1ft SSW
12pm Fri 25 Sep  0.3m @ 12.1s S (169°)    0.5m @ 8.9s SSW (192°)   0.5ft S
6pm Fri 25 Sep   0.5m @ 8.8s S (188°)     0.3m @ 11.6s S (169°)    0.5ft S

1条回答
网友
1楼 · 发布于 2024-04-24 23:32:13

您实际上不必使用两个tip_train实例。您仍然可以刮取所有数据,如果有任何缺失,替换缺失的部分并打印您得到的数据

这里有一种方法:

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate


url = "https://www.swellnet.com/reports/australia/new-south-wales/northern-beaches/forecast"
response = requests.get(url)
forecast = BeautifulSoup(response.content, 'html.parser').find_all("div", {"class": "forecast_tip"})


def get_data(html, attribute: str, _class: str) -> list:
    result = []

    for tag in html:
        item = tag.find(attribute, {"class": _class})
        if item is not None:
            result.append(item.getText())
        else:
            result.append("N/A")

    return result


date = get_data(forecast, "div", "tip_date_time")
train = get_data(forecast, "div", "tip_train")
wave = get_data(forecast, "span", "tip_wave")

forecast_data = list(zip(date, train, wave))
headers = ["Date", "Swell Train Data", "Wave Height"]

print(tabulate([*forecast_data], headers=headers))

这张照片是:

Date             Swell Train Data         Wave Height
       -             -        -
6am Wed 23 Sep   0.6m @ 8.3s NE (54°)     1ft NE
12pm Wed 23 Sep  0.5m @ 8.4s NE (54°)     1ft NE
6pm Wed 23 Sep   0.4m @ 8.4s NE (56°)     1ft NE
6am Thu 24 Sep   0.4m @ 10.1s SSW (204°)  0.5ft SSW
12pm Thu 24 Sep  0.6m @ 10.1s SSW (205°)  1ft SSW
6pm Thu 24 Sep   0.7m @ 9.9s SSW (203°)   1ft SSW
6am Fri 25 Sep   0.6m @ 9.1s SSW (197°)   1ft SSW
12pm Fri 25 Sep  0.3m @ 12.1s S (169°)    0.5ft S
6pm Fri 25 Sep   0.5m @ 8.8s S (188°)     0.5ft S

相关问题 更多 >