根据条件从XML文档树中移除元素

0 投票

4 回答

2660 浏览

提问于 2025-04-16 06:56

我的任务是对Python 3中的一些XML树元素进行小规模的重构，也就是要把以下的结构：

<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
  <sup>
   <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/>
  </sup>
 </a>
</span>

替换成：

<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
 </a>
</span>

也就是说，如果整个结构和第一个例子完全一致，就要去掉sup元素。在这个过程中，我需要保持XML文档的完整性，所以不能使用正则表达式来匹配。

我已经有了一段可以满足我需求的代码：

doc = self.__refactor_links(doc)
...
def __refactor_links(self, node):
    """Recursively seeks for links to refactor them"""
    for span in node.childNodes:
        replace = False
        if isinstance(span, xml.dom.minidom.Element):
            if span.tagName == "span" and span.getAttribute("class") == "nobr":
                if span.childNodes.length == 1:
                    a = span.childNodes.item(0)
                    if isinstance(a, xml.dom.minidom.Element):
                        if a.tagName == "a" and a.getAttribute("href"):
                            if a.childNodes.length == 2:
                                aurl = a.childNodes.item(0)
                                if isinstance(aurl, xml.dom.minidom.Text):
                                    sup = a.childNodes.item(1)
                                    if isinstance(sup, xml.dom.minidom.Element):
                                        if sup.tagName == "sup":
                                            if sup.childNodes.length == 1:
                                                img = sup.childNodes.item(0)
                                                if isinstance(img, xml.dom.minidom.Element):
                                                    if img.tagName == "img" and img.getAttribute("class") == "rendericon":
                                                        replace = True
            else:
                self.__refactor_links(span)
        if replace:
            a.removeChild(sup)
    return node

这段代码并没有递归地遍历所有标签——如果它找到的结构和想要的相似，即使匹配失败，它也不会继续在这些元素内部寻找结构。不过在我的情况下，我不需要这样做（虽然这样做也不错，但加上很多else: self.__refactor_links(tag)会让我觉得不太好）。

如果有任何条件不满足，就不应该进行删除。有没有更简洁的方法来定义一组条件，避免写一大堆'if'语句？可以使用一些自定义的数据结构来存储条件，比如('sup', ('img', (...)))，但我不知道该怎么处理。如果你有任何建议或Python的例子，请帮帮我。

谢谢。

XML 代码重构数据结构条件判断递归遍历标签处理元素移除文档结构

4 个回答

这里有个关于 lxml 的小技巧。我强烈推荐使用 xpath。

>>> from lxml import etree
>>> doc = etree.XML("""<span class="nobr">
...  <a href="http://www.google.com/">
...   http://www.google.com/
...   <sup>
...    <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/>
...   </sup>
...  </a>
... </span>""")
>>> for a in doc.xpath('//span[@class="nobr"]/a[@href="http://www.google.com/"]'):
...     for sub in list(a):
...         a.remove(sub)
...
>>> print etree.tostring(doc,pretty_print=True)
<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
  </a>
</span>

回答于 2025-04-16 由 Python大师

分享举报

这绝对是一个适合用XPath表达式来解决的任务，可能还需要结合lxml这个库。

这个XPath表达式大概是这样的：

//span[@class="nobr"]/a[@href]/sup[img/@class="rendericon"]

用这个XPath表达式去匹配你的树形结构，然后把所有匹配到的元素删除掉就行了。这样就不需要写很多复杂的if语句或者递归了。

回答于 2025-04-16 由 Python大师

分享举报

我对xml不太熟悉，但你难道不能在节点上使用查找/搜索功能吗？

>>> from xml.dom.minidom import parse, parseString
>>> dom = parseString(x)
>>> k = dom.getElementsByTagName('sup')
>>> for l in k:
...     p = l.parentNode
...     p.removeChild(l)
... 
<DOM Element: sup at 0x100587d40>
>>> 
>>> print dom.toxml()
<?xml version="1.0" ?><span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/

 </a>
</span>
>>>

回答于 2025-04-16 由 Python大师

分享举报

根据条件从XML文档树中移除元素

4 个回答

撰写回答