测试xml.etree.ElementTree的等价性

36 投票

6 回答

24998 浏览

提问于 2025-04-17 05:03

我对两个xml元素是否相等很感兴趣；我发现测试这两个元素的字符串表示可以工作，但这看起来有点不太正规。

有没有更好的方法来测试两个etree元素是否相等呢？

直接比较元素：

import xml.etree.ElementTree as etree
h1 = etree.Element('hat',{'color':'red'})
h2 = etree.Element('hat',{'color':'red'})

h1 == h2  # False

将元素作为字符串比较：

etree.tostring(h1) == etree.tostring(h2)  # True

6 个回答

序列化和反序列化在处理XML时不太管用，因为XML的属性顺序不重要（还有其他原因）。比如，下面这两个元素在逻辑上是一样的，但它们的字符串却不同：

<THING a="foo" b="bar"></THING>
<THING b="bar" a="foo"  />

比较两个元素到底该怎么做其实挺复杂的。根据我的了解，Element Tree里没有直接的功能可以帮你完成这个任务。我自己写了代码来实现这个功能，下面的代码对我来说是有效的，但对于大型XML结构来说并不合适，而且速度也不快，效率也不高！这个函数是用来判断顺序的，而不是判断是否相等，所以如果结果是0就表示相等，其他结果则表示不相等。至于如何把它包装成返回True或False的函数，就留给读者自己去做练习吧！

def cmp_el(a,b):
    if a.tag < b.tag:
        return -1
    elif a.tag > b.tag:
        return 1
    elif a.tail < b.tail:
        return -1
    elif a.tail > b.tail:
        return 1

    #compare attributes
    aitems = a.attrib.items()
    aitems.sort()
    bitems = b.attrib.items()
    bitems.sort()
    if aitems < bitems:
        return -1
    elif aitems > bitems:
        return 1

    #compare child nodes
    achildren = list(a)
    achildren.sort(cmp=cmp_el)
    bchildren = list(b)
    bchildren.sort(cmp=cmp_el)

    for achild, bchild in zip(achildren, bchildren):
        cmpval = cmp_el(achild, bchild)
        if  cmpval < 0:
            return -1
        elif cmpval > 0:
            return 1    

    #must be equal 
    return 0

回答于 2025-04-17 由 Python大师

分享举报

比较字符串并不总是有效。考虑两个节点是否相等时，属性的顺序其实不应该影响结果。但是，如果你直接进行字符串比较，顺序显然就很重要了。

我不确定这算不算问题，还是说这是一个特性，但我用的 lxml.etree 版本在从文件或字符串解析属性时，会保留属性的顺序：

>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
False

这可能和版本有关（我在 Ubuntu 上使用的是 Python 2.7.3 和 lxml.etree 2.3.2）；我记得大约一年前我想控制属性的顺序时，找不到方法（因为为了可读性）。

由于我需要比较由不同序列化器生成的 XML 文件，我觉得唯一的办法就是递归地比较每个节点的标签、文本、属性和子节点。当然，如果尾部有有趣的内容，也要考虑进去。

lxml 和 xml.etree.ElementTree 的比较

其实这可能和具体的实现有关。显然，lxml 使用了有序字典之类的东西，而标准的 xml.etree.ElementTree 则不保留属性的顺序：

Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> h1 = etree.XML('<hat color="blue" price="39.90"/>')
>>> h2 = etree.XML('<hat price="39.90" color="blue"/>')
>>> etree.tostring(h1) == etree.tostring(h2)
False
>>> etree.tostring(h1)
'<hat color="blue" price="39.90"/>'
>>> etree.tostring(h2)
'<hat price="39.90" color="blue"/>'
>>> etree.dump(h1)
<hat color="blue" price="39.90"/>>>> etree.dump(h2)
<hat price="39.90" color="blue"/>>>>

（是的，换行符缺失了。但这算是个小问题。）

>>> import xml.etree.ElementTree as ET
>>> h1 = ET.XML('<hat color="blue" price="39.90"/>')
>>> h1
<Element 'hat' at 0x2858978>
>>> h2 = ET.XML('<hat price="39.90" color="blue"/>')
>>> ET.dump(h1)
<hat color="blue" price="39.90" />
>>> ET.dump(h2)
<hat color="blue" price="39.90" />
>>> ET.tostring(h1) == ET.tostring(h2)
True
>>> ET.dump(h1) == ET.dump(h2)
<hat color="blue" price="39.90" />
<hat color="blue" price="39.90" />
True

另一个问题可能是，在比较时什么算是不重要的。例如，有些片段可能包含多余的空格，而我们并不想在意。这样的话，最好写一个序列化函数，确保它能完全按照我们的需求工作。

回答于 2025-04-17 由 Python大师

分享举报

这个比较函数对我来说很好用：

def elements_equal(e1, e2):
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))

回答于 2025-04-17 由 Python大师

分享举报

测试xml.etree.ElementTree的等价性

6 个回答

撰写回答