Python/Beautiful Soup 查找特定标题并输出完整 div

0 投票

1 回答

3737 浏览

提问于 2025-04-20 16:04

我正在尝试解析一个非常大的HTML文档，内容大概是这样的：

<div class="reportsubsection n" ><br>
   <h2> part 1 </h2><br>
   <p> insert text here </p><br>
  <table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
   <h2> part 2 </h2><br>
   <p> insert text here </p><br>
   <table> crazy table thing here </table><br>
</div>

我需要根据包含“Part 2”这个文本的h2标签来提取第二个div。我已经能够提取出所有的div，方法是：

divTag = soup.find("div", {"id": "reportsubsection"})

但我不知道接下来该怎么缩小范围。我在其他帖子中找到的方法可以找到特定的文本“part 2”，但我需要能够输出它所在的整个DIV部分。

编辑/更新

抱歉，我还是有点迷糊。现在我得到的结果是这样的。我觉得这应该比我想的要简单得多。再次感谢大家的帮助！

divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
    if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
        continue<br>
print divTag

data extraction web scraping beautiful soup document parsing HTML div element text parsing h2 tag

1 个回答

你可以在找到合适的 h2 后随时返回上层，或者你也可以测试所有的子部分：

for subsection in soup.select('div#reportsubsection #subsection'):
    if not subsection.find('h2', text=re.compile('part 2')):
        continue
    # do something with this subsection

这里使用了一个 CSS选择器来找到所有的 subsection。

或者，你也可以使用 .parent 属性返回上层：

for header in soup.find_all('h2', text=re.compile('part 2')):
    section = header.parent

关键是尽早缩小你的搜索范围；第二种方法需要在整个文档中找到所有的 h2 元素，而第一种方法则能更快地缩小搜索范围。

回答于 2025-04-20 由 Python大师

分享举报

Python/Beautiful Soup 查找特定标题并输出完整 div

1 个回答

撰写回答