使用BeautifulSoup提取不同元素：避免嵌套元素重复

2 投票

1 回答

619 浏览

提问于 2025-04-18 03:25

我想用BeautifulSoup4从本地保存的网站（Python文档）中提取不同的内容（类），所以我用这段代码来实现这个目的（index.html就是这个保存的网站：https://docs.python.org/3/library/stdtypes.html）。

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
f = open('test.html','w')
f.truncate
classes= soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
print(classes,file=f) 
f.close()

文件处理器只是用来输出结果的，对问题本身没有影响。

我的问题是，结果是嵌套的。比如说，方法“__eq__”会在1.类里面找到一次，2.作为独立的方法又找到一次。

所以我想把所有嵌套在其他结果里的结果去掉，让每个结果都处于同一个层级。请问我该怎么做？或者在第一步中是否可以“忽略”这些内容？希望你能理解我的意思。

数据提取文件处理 html解析类和方法 beautifulsoup 网页解析嵌套元素结果去重

1 个回答

你不能让 find 忽略嵌套的 dl 元素；你能做的就是忽略出现在 .descendants 中的匹配项：

matches = []
for dl in soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
    if any(dl in m.descendants for m in matches):
        # child of already found element
        continue
    matches.append(dl)

如果你想要嵌套的元素而不想要父元素，可以使用：

matches = []
for dl in soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
    matches = [m for m in matches if dl not in m.descendants]
    matches.append(dl)

如果你想把树拆开并且移除树中的元素，可以使用：

matches = soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
for element in matches:
    element.extract()  # remove from tree (and parent `dl` matches)

但你可能想要调整一下提取文本的方式。

回答于 2025-04-18 由 Python大师

分享举报

使用BeautifulSoup提取不同元素：避免嵌套元素重复

1 个回答

撰写回答