我正在尝试使用Python中的bs4刮取html,其中包含重复的相同标记,这些标记包含我想要的数据。我要收集的数据包括 class="tip_date_time", class="tip_wave" and class="tip_train".
到目前为止,我在Python中完成了以下工作:
soup = BeautifulSoup(res.content, 'html.parser')
html = soup.find_all("div", {"class": "forecast_tip"})
dateCond = []
for date in html:
for text in date.find_all("div", {"class": "tip_date_time"}):
dateCond.append(text.getText())
waveCond = []
for wave in html:
for text in wave.find_all("span", {"class": "tip_wave"}):
waveCond.append(text.getText())
这将为我打算根据索引排序的每个刮片创建单独的列表。因此dateCond[0]将与waveCond[0]对齐。这可以正常工作,因为每个列表都有相同数量的项目
然而,我遇到了一个问题“提示火车”,因为这可能会从1个条目变化到3个条目,具体取决于日期。因此,如果我使用相同的代码,我可能会有一个与其他列表长度不同的列表,并放弃排序
因此,我希望能够仅选择“tip_train”的前2个实例,因为它位于“tip_date_time”div的每个块中。我不能仅选择前2个被刮取的实例,因为我希望每天的前2个实例
Html代码如下:
<div class="forecast_tip">
<div class="tip_date_time">6am Mon 21 Sep</div>
<div class="tip_surf">
<span class="tip_wave">2ft ENE</span>
<span class="tip_wind">7kt NNW</span>
</div>
<div class="tip_description">(Waist-Shoulder High)</div>
<div class="tip_train">1.3m @ 7.5s ENE (64°)</div>
<div class="tip_train">0.4m @ 13.1s SSE (167°)</div>
<div class="tip_train">0.3m @ 13.8s SSW (194°)</div>
<div class="tip_tides">
<div class="tip_tide">
<span class="tip_tide_label">Low:</span>
<span class="tip_tide_value">Sun 4:29pm (0.20m)</span>
</div>
<div class="tip_tide">
<span class="tip_tide_label">High:</span>
<span class="tip_tide_value">Sun 10:40pm (1.67m)</span>
</div>
</div>
</div>
编辑: 在下面的响应之后,我编辑添加了[0]和[1]索引,并替换为find_all,这允许我访问“div”“tip_train”的第二个(可能是第三个)实例。i、 e.一次和二次涌浪
url = "https://www.swellnet.com/reports/australia/new-south-wales/northern-beaches/forecast"
res = requests.get(url)
res.raise_for_status
soup = BeautifulSoup(res.content, 'html.parser')
forecast = soup.find_all("div", {"class": "forecast_tip"}) # scrapes the swell train block of code for the whole div tag that includes class forecast_tip. will ouput 9 items (3 days x 6am, 12pm, 6pm)
def getData(html, attribute, _class, index):
result = []
for tag in html:
for item in tag.find_all(attribute, {"class": _class})[index]:
if item is not None:
result.append(item)
else:
result.append("N/A")
return result
date = getData(forecast, "div", "tip_date_time", 0)
train1 = getData(forecast, "div", "tip_train", 0)
train2 = getData(forecast, "div", "tip_train", 1)
wave = getData(forecast, "span", "tip_wave", 0)
logging.debug(date)
logging.debug(train1)
logging.debug(train2)
logging.debug(wave)
forecast_data = list(zip(date, train1, train2, wave))
headers = ["Date", "Primary Swell", "Secondary Swell", "Wave Height"]
print(tabulate([*forecast_data], headers=headers))
结果如下:
Date Primary Swell Secondary Swell Wave Height
--------------- ----------------------- ----------------------- -------------
6am Wed 23 Sep 0.6m @ 8.3s NE (54°) 0.2m @ 13s SSW (195°) 1ft NE
12pm Wed 23 Sep 0.5m @ 8.4s NE (54°) 0.2m @ 12.3s SSW (194°) 1ft NE
6pm Wed 23 Sep 0.4m @ 8.4s NE (56°) 0.2m @ 11.1s SSW (200°) 1ft NE
6am Thu 24 Sep 0.4m @ 10.1s SSW (204°) 0.2m @ 9.9s ENE (77°) 0.5ft SSW
12pm Thu 24 Sep 0.6m @ 10.1s SSW (205°) 0.3m @ 9.8s ENE (73°) 1ft SSW
6pm Thu 24 Sep 0.7m @ 9.9s SSW (203°) 0.2m @ 9.8s ENE (77°) 1ft SSW
6am Fri 25 Sep 0.6m @ 9.1s SSW (197°) 0.2m @ 12.5s SSE (165°) 1ft SSW
12pm Fri 25 Sep 0.3m @ 12.1s S (169°) 0.5m @ 8.9s SSW (192°) 0.5ft S
6pm Fri 25 Sep 0.5m @ 8.8s S (188°) 0.3m @ 11.6s S (169°) 0.5ft S
您实际上不必使用两个
tip_train
实例。您仍然可以刮取所有数据,如果有任何缺失,替换缺失的部分并打印您得到的数据这里有一种方法:
这张照片是:
相关问题 更多 >
编程相关推荐