提取最内层嵌套链接

1 投票
1 回答
27 浏览
提问于 2025-04-14 15:28

在这里输入图片描述

对于这个示例的HTML页面:

<div class="ia-secondary-content">
  <div class="plugin_pagetree conf-macro output-inline" data-hasbody="false" data-macro-name="pagetree">
    <div class="plugin_pagetree_children_list plugin_pagetree_children_list_noleftspace">
      <div class="plugin_pagetree_children" id="children1326817570-0">
        <ul class="plugin_pagetree_children_list" id="child_ul1326817570-0">
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="false" aria-label="Expand item Topic 1" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="1630374642" data-tree-id="0" data-type="toggle" href="" id="plusminus1630374642-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan1630374642-0"> <a href="#">Topic 1</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children1630374642-0"></div>
          </li>
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="false" aria-label="Expand item Topic 2" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="1565544568" data-tree-id="0" data-type="toggle" href="" id="plusminus1565544568-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan1565544568-0"> <a href="#">Topic 2</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children1565544568-0"></div>
          </li>
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="true" aria-label="Expand item Topic 3" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-down" data-children-loaded="true" data-expanded="true" data-page-id="3733362288" data-tree-id="0" data-type="toggle"
                href="" id="plusminus3733362288-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan3733362288-0"> <a href="#">Topic 3</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children3733362288-0">
              <ul class="plugin_pagetree_children_list" id="child_ul3733362288-0">
                <li>
                  <div class="plugin_pagetree_childtoggle_container">
                    <span class="no-children icon"></span>
                  </div>
                  <div class="plugin_pagetree_children_content">
                    <span class="plugin_pagetree_children_span"> <a href="#">Subtopic 1</a></span>
                  </div>
                  <div class="plugin_pagetree_children_container"></div>
                </li>
                <li>
                  <div class="plugin_pagetree_childtoggle_container">
                    <span class="no-children icon"></span>
                  </div>
                  <div class="plugin_pagetree_children_content">
                    <span class="plugin_pagetree_children_span"> <a href="#">Subtopic 2</a></span>
                  </div>
                  <div class="plugin_pagetree_children_container"></div>
                </li>
              </ul>
            </div>
          </li>
          <li>
            <div class="plugin_pagetree_childtoggle_container">
              <a aria-expanded="false" aria-label="Expand item Topic 4" class="plugin_pagetree_childtoggle aui-icon aui-icon-small aui-iconfont-chevron-right" data-page-id="2238798992" data-tree-id="0" data-type="toggle" href="" id="plusminus2238798992-0"></a>
            </div>
            <div class="plugin_pagetree_children_content">
              <span class="plugin_pagetree_children_span" id="childrenspan2238798992-0"> <a href="#">Topic 4</a></span>
            </div>
            <div class="plugin_pagetree_children_container" id="children2238798992-0"></div>
          </li>
        </ul>
      </div>
    </div>
    <fieldset class="hidden">
    </fieldset>
  </div>
</div>

我需要从这种页面结构中提取最里面的链接。给定一个标题,我想在这个标题下找到所有的链接,我该如何找到所有最里面的链接呢?我想写一个Python脚本,能够动态提取不同HTML页面中的最里面的链接。需要注意的是,嵌套的层级可能不一样。

所以对于这个示例,我应该得到:

<a href="#">Subtopic 1</a>
<a href="#">Subtopic 2</a>

我尝试提取所有链接,保持相同的嵌套结构,但没有成功。

# Step 1: Find the div with the given title
title = "Topic 3"
target_div = soup.find('span', class_='plugin_pagetree_children_span', text=title)

# Step 2: Extract the next div with class "plugin_pagetree_children_container"
if target_div:
    container_div = target_div.find_next_sibling('div', class_='plugin_pagetree_children_container')

    # Step 3: Extract all links within the container and print them
    if container_div:
        links = container_div.find_all('a')
        for link in links:
            print(link['href'])

1 个回答

0

如果我理解正确,你可以这样做:

from bs4 import BeautifulSoup

# html_text = ... # your html code from the question

soup = BeautifulSoup(html_text, "html.parser")

for a in soup.select("li li a"):
    print(a)

输出结果是:

<a href="#">Subtopic 1</a>
<a href="#">Subtopic 2</a>

编辑:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, "html.parser")

result, tags = None, ["li", "a"]
while True:
    a = soup.select(" ".join(tags))

    if not a:
        break
    else:
        tags.insert(0, "li")
        result = a

print(result)

输出结果是:

[<a href="#">Subtopic 1</a>, <a href="#">Subtopic 2</a>]

撰写回答