将BeautifulSoup分割为两个Soup树

1 投票

1 回答

1650 浏览

提问于 2025-04-17 17:52

有很多方法可以把BeautifulSoup的解析树拆分成元素列表或者获取标签的字符串。但是，似乎没有办法在拆分的同时保持树的完整性。

我想在这个代码片段（soup）中按 进行拆分。用字符串拆分很简单，但我想保持结构，想要得到一系列的解析树。

s="""<p>
foo<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar
</p>"""
soup=BeautifulSoup(s)

当然，我可以用[BeautifulSoup(i) for i in str(soup).split(' ')]来实现，但这样看起来很丑，而且我有太多链接了，不想这么做。

用soup.next和soup.previousSibling()在soup.findAll('br')上进行迭代是可以的，但这样得到的不是解析树，而只是包含的所有元素。

有没有办法从一个BeautifulSoup标签中提取出完整的子树，同时保持所有的父子关系和兄弟关系呢？

为了更清楚：

结果应该是一个包含BeautifulSoup对象的列表，我可以通过output[0].a、output[1].text等方式进一步遍历拆分后的soup。按 拆分的结果会返回一个包含所有链接的列表，方便我进一步处理，这正是我需要的。上面代码片段中的所有链接，包括文本、属性和后面的“bar”，都是每个链接的描述。

beautifulsoup 父子关系标签提取解析树子树元素列表结构保持兄弟关系

1 个回答

如果你不介意原来的树结构被改变，我会用 .extract() 方法来处理   标签，这样就可以简单地把它们从树中移除：

>>> for br in soup.find_all('br'): br.extract()
... 
<br/>
<br/>
<br/>
<br/>
>>> soup
<html><body><p>
foo
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
</p></body></html>

这仍然是一个完整的工作树：

>>> soup.p
<p>
foo
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
</p>
>>> soup.p.a
<a href="http://...html" target="_blank">foo</a>

不过，你其实根本不需要移除这些标签，就能达到你想要的效果：

for link in soup.find_all('a'):
    print link['href'], ''.join(link.stripped_strings), link.next_sibling

结果是：

>>> for link in soup.find_all('a'):
...     print link['href'], ''.join(link.stripped_strings), link.next_sibling
... 
http://...html foo  | bar
http://...html foo  | bar
http://...html foo  | bar
http://...html foo  | bar

无论我们是否先把   标签从树中移除，结果都是一样的。

回答于 2025-04-17 由 Python大师

分享举报

将BeautifulSoup分割为两个Soup树

1 个回答

撰写回答