使用Python进行XML过滤

7 投票

1 回答

16355 浏览

提问于 2025-04-15 22:54

我有一个这样的xml文档：

<node0>
    <node1>
      <node2 a1="x1"> ... </node2>
      <node2 a1="x2"> ... </node2>
      <node2 a1="x1"> ... </node2>
    </node1>
</node0>

我想在的时候过滤掉node2。用户会提供需要测试和过滤的xpath和属性值。我看了一些Python的解决方案，比如BeautifulSoup，但它们太复杂了，而且不能保持文本的大小写。我希望在过滤掉一些内容的同时，文档的其他部分保持不变。

你能推荐一个简单明了的解决方案吗？看起来这个问题不应该太复杂。实际上，xml文档没有这么简单，但思路是一样的。

xpath 属性值数据清洗 beautifulsoup 文档处理 xml过滤

1 个回答

这里使用的是 xml.etree.ElementTree，这是Python自带的库：

import xml.etree.ElementTree as xee
data='''\
<node1>
  <node2 a1="x1"> ... </node2>
  <node2 a1="x2"> ... </node2>
  <node2 a1="x1"> ... </node2>
</node1>
'''
doc=xee.fromstring(data)

for tag in doc.findall('node2'):
    if tag.attrib['a1']=='x2':
        doc.remove(tag)
print(xee.tostring(doc))
# <node1>
#   <node2 a1="x1"> ... </node2>
#   <node2 a1="x1"> ... </node2>
# </node1>

这里使用的是 lxml，这个库不是Python自带的，但它有更强大的语法：

import lxml.etree
data='''\
<node1>
  <node2 a1="x1"> ... </node2>
  <node2 a1="x2"> ... </node2>
  <node2 a1="x1"> ... </node2>
</node1>
'''
doc = lxml.etree.XML(data)
e=doc.find('node2/[@a1="x2"]')
doc.remove(e)
print(lxml.etree.tostring(doc))

# <node1>
#   <node2 a1="x1"> ... </node2>
#   <node2 a1="x1"> ... </node2>
# </node1>

补充：如果 node2 在xml中嵌得很深，你可以遍历所有的标签，检查每个父标签，看 node2 元素是否是它的子标签，如果是的话就把它移除：

仅使用 xml.etree.ElementTree：

doc=xee.fromstring(data)
for parent in doc.getiterator():
    for child in parent.findall('node2'):
        if child.attrib['a1']=='x2':
            parent.remove(child)

使用 lxml：

doc = lxml.etree.XML(data)
for parent in doc.iter('*'):
    child=parent.find('node2/[@a1="x2"]')
    if child is not None:
        parent.remove(child)

回答于 2025-04-15 由 Python大师

分享举报

使用Python进行XML过滤

1 个回答

撰写回答