如何让BeautifulSoup 4识别自闭合标签？

12 投票

1 回答

6097 浏览

数据工程师

提问于 2025-04-17 16:22

这个问题是关于BeautifulSoup4的，跟之前的问题有些不同：

为什么BeautifulSoup会修改我的自闭合元素？

BeautifulSoup中的自闭合标签

因为BeautifulStoneSoup已经不再使用（之前的xml解析器），我该如何让bs4支持新的自闭合标签呢？比如：

import bs4   
S = '''<foo> <bar a="3"/> </foo>'''
soup = bs4.BeautifulSoup(S, selfClosingTags=['bar'])

print soup.prettify()

这个例子中，bar标签没有自闭合，但给出了一个提示。那么，bs4提到的这个树构建器是什么？我该如何让这个标签自闭合呢？

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
  "BS4 does not respect the selfClosingTags argument to the "
<html>
 <body>
  <foo>
   <bar a="3">
   </bar>
  </foo>
 </body>
</html>

html解析解析库 beautifulsoup 网页解析数据抓取 xml解析器自闭合标签树构建器

1 个回答

要解析XML文件，你需要在创建BeautifulSoup对象时，把“xml”作为第二个参数传进去。

soup = bs4.BeautifulSoup(S, 'xml')

你需要先安装lxml这个库。

现在你不需要再传入selfClosingTags这个参数了：

In [1]: import bs4
In [2]: S = '''<foo> <bar a="3"/> </foo>'''
In [3]: soup = bs4.BeautifulSoup(S, 'xml')
In [4]: print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<foo>
 <bar a="3"/>
</foo>

回答于 2025-04-17 由 Python大师

分享举报

如何让BeautifulSoup 4识别自闭合标签？

1 个回答

撰写回答