BeautifulSoup：如何获取嵌套的div

9 投票

1 回答

23418 浏览

提问于 2025-04-30 08:46

给定以下代码：

<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>

如何使用BeautifulSoup从<div class="category5"> test中提取出单词test呢？也就是说，如何处理嵌套的div？我在网上查找过，但没有找到简单易懂的例子，所以我自己设置了这个例子。谢谢。

暂无标签

1 个回答

XPath 本来是个很简单的解决办法，但在 BeautifulSoup 中不支持。

更新：提供一个 BeautifulSoup 的解决方案

如果你知道要找的类名和元素（在这个例子中是 div），你可以用 for/loop 和 attrs 来获取你想要的内容：

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>'''

content = BeautifulSoup(html)

for div in content.findAll('div', attrs={'class':'category5'}):
    print div.text

test

我从你的 HTML 示例中提取文本没有问题，正如 @MartijnPieters 所建议的，你需要找出为什么你的 div 元素缺失。

另一个更新

因为你缺少 lxml 这个解析器，所以 BeautifulSoup 返回了 None，因为你根本没有解析任何东西。安装 lxml 应该能解决你的问题。

你可以考虑使用 lxml 或类似的工具，它们支持 XPath，如果你问我，这非常简单。

from lxml import etree

tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[@class="category5"]/text()')
[' test\n                 ']

回答于 2025-04-30 由 Python大师

分享举报

BeautifulSoup：如何获取嵌套的div

1 个回答

更新：提供一个 BeautifulSoup 的解决方案

另一个更新

撰写回答