提取最内层嵌套链接
对于这个示例的HTML页面:
<div class="ia-secondary-content">
<div class="plugin_pagetree conf-macro output-inline" data-hasbody="false" data-macro-name="pagetree">
<div class="plugin_pagetree_children_list plugin_pagetree_children_list_noleftspace">
<div class="plugin_pagetree_children" id="children1326817570-0">
<ul class="plugin_pagetree_children_list" id="child_ul1326817570-0">
<li>
<div class="plugin_pagetree_childtoggle_container">
<a aria-expanded="false" aria-label="Expand item Topic 1" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="1630374642" data-tree-id="0" data-type="toggle" href="" id="plusminus1630374642-0"></a>
</div>
<div class="plugin_pagetree_children_content">
<span class="plugin_pagetree_children_span" id="childrenspan1630374642-0"> <a href="#">Topic 1</a></span>
</div>
<div class="plugin_pagetree_children_container" id="children1630374642-0"></div>
</li>
<li>
<div class="plugin_pagetree_childtoggle_container">
<a aria-expanded="false" aria-label="Expand item Topic 2" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="1565544568" data-tree-id="0" data-type="toggle" href="" id="plusminus1565544568-0"></a>
</div>
<div class="plugin_pagetree_children_content">
<span class="plugin_pagetree_children_span" id="childrenspan1565544568-0"> <a href="#">Topic 2</a></span>
</div>
<div class="plugin_pagetree_children_container" id="children1565544568-0"></div>
</li>
<li>
<div class="plugin_pagetree_childtoggle_container">
<a aria-expanded="true" aria-label="Expand item Topic 3" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-down" data-children-loaded="true" data-expanded="true" data-page-id="3733362288" data-tree-id="0" data-type="toggle"
href="" id="plusminus3733362288-0"></a>
</div>
<div class="plugin_pagetree_children_content">
<span class="plugin_pagetree_children_span" id="childrenspan3733362288-0"> <a href="#">Topic 3</a></span>
</div>
<div class="plugin_pagetree_children_container" id="children3733362288-0">
<ul class="plugin_pagetree_children_list" id="child_ul3733362288-0">
<li>
<div class="plugin_pagetree_childtoggle_container">
<span class="no-children icon"></span>
</div>
<div class="plugin_pagetree_children_content">
<span class="plugin_pagetree_children_span"> <a href="#">Subtopic 1</a></span>
</div>
<div class="plugin_pagetree_children_container"></div>
</li>
<li>
<div class="plugin_pagetree_childtoggle_container">
<span class="no-children icon"></span>
</div>
<div class="plugin_pagetree_children_content">
<span class="plugin_pagetree_children_span"> <a href="#">Subtopic 2</a></span>
</div>
<div class="plugin_pagetree_children_container"></div>
</li>
</ul>
</div>
</li>
<li>
<div class="plugin_pagetree_childtoggle_container">
<a aria-expanded="false" aria-label="Expand item Topic 4" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="2238798992" data-tree-id="0" data-type="toggle" href="" id="plusminus2238798992-0"></a>
</div>
<div class="plugin_pagetree_children_content">
<span class="plugin_pagetree_children_span" id="childrenspan2238798992-0"> <a href="#">Topic 4</a></span>
</div>
<div class="plugin_pagetree_children_container" id="children2238798992-0"></div>
</li>
</ul>
</div>
</div>
<fieldset class="hidden">
</fieldset>
</div>
</div>
我需要从这种页面结构中提取最里面的链接。给定一个标题,我想在这个标题下找到所有的链接,我该如何找到所有最里面的链接呢?我想写一个Python脚本,能够动态提取不同HTML页面中的最里面的链接。需要注意的是,嵌套的层级可能不一样。
所以对于这个示例,我应该得到:
<a href="#">Subtopic 1</a>
<a href="#">Subtopic 2</a>
我尝试提取所有链接,保持相同的嵌套结构,但没有成功。
# Step 1: Find the div with the given title
title = "Topic 3"
target_div = soup.find('span', class_='plugin_pagetree_children_span', text=title)
# Step 2: Extract the next div with class "plugin_pagetree_children_container"
if target_div:
container_div = target_div.find_next_sibling('div', class_='plugin_pagetree_children_container')
# Step 3: Extract all links within the container and print them
if container_div:
links = container_div.find_all('a')
for link in links:
print(link['href'])
1 个回答
0
如果我理解正确,你可以这样做:
from bs4 import BeautifulSoup
# html_text = ... # your html code from the question
soup = BeautifulSoup(html_text, "html.parser")
for a in soup.select("li li a"):
print(a)
输出结果是:
<a href="#">Subtopic 1</a>
<a href="#">Subtopic 2</a>
编辑:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "html.parser")
result, tags = None, ["li", "a"]
while True:
a = soup.select(" ".join(tags))
if not a:
break
else:
tags.insert(0, "li")
result = a
print(result)
输出结果是:
[<a href="#">Subtopic 1</a>, <a href="#">Subtopic 2</a>]