使用BeautifulSoup列出不同的XML元素名称

2 投票

1 回答

2858 浏览

提问于 2025-04-20 05:26

我正在使用BeautifulSoup来解析一个XML文档。有没有简单的方法可以获取文档中使用的不同元素名称的列表？

比如，如果这个文档是：

<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

我想得到的是：note, to, from, heading, body

数据提取 beautifulsoup xml解析文档处理元素名称

1 个回答

你可以使用find_all()这个方法，来获取每个找到的标签的.name：

from bs4 import BeautifulSoup

data = """<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
"""

soup = BeautifulSoup(data, 'xml')
print [tag.name for tag in soup.find_all()]

输出结果：

['note', 'to', 'from', 'heading', 'body']

注意，要让这个方法正常工作，你需要安装lxml这个模块。根据文档的说明：

目前，唯一支持的XML解析器是lxml。如果你没有安装lxml，想要使用XML解析器是没用的，直接请求“lxml”也不会成功。

那么，接下来，为什么不直接使用一个专门的XML解析器呢？

举个例子，使用lxml：

from lxml import etree

data = """<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
"""

tree = etree.fromstring(data)
print [item.tag for item in tree.xpath('//*')]

输出结果：

['note', 'to', 'from', 'heading', 'body']

接着，为什么要为这么简单的任务使用第三方工具呢？

举个例子，使用来自标准库的xml.etree.ElementTree：

from xml.etree.ElementTree import fromstring, ElementTree

data = """<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to> Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
"""

tree = ElementTree(fromstring(data))
print [item.tag for item in tree.getiterator()]

输出结果：

['note', 'to', 'from', 'heading', 'body']

回答于 2025-04-20 由 Python大师

分享举报

使用BeautifulSoup列出不同的XML元素名称

1 个回答

撰写回答