使用Python和minidom解析XML

18 投票

5 回答

98460 浏览

提问于 2025-04-15 15:13

我正在使用Python的minidom库来解析一个XML文件，这个文件的结构是层级式的，像这样（这里用缩进来表示层级关系）：

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

但是，程序在处理节点时反复迭代，结果打印出重复的节点。（每次迭代查看节点列表时，很明显为什么会这样，但我找不到获取我想要的节点列表的方法。）

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

这是XML源文件：

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

这是Python程序：

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

我可以通过不嵌套“主题”元素来解决这个问题，把较低层级的主题名称改成“子主题1”和“子主题2”之类的名字。但我想利用XML自带的层级结构，而不需要不同的元素名称；我觉得应该可以嵌套“主题”元素，并且应该有某种方法来知道我当前正在查看哪个层级的“主题”。

我尝试了很多不同的XPath函数，但效果都不太好。

数据结构节点处理 xpath xml解析 minidom 层级结构重复节点主题元素

5 个回答

我觉得这可能会有帮助

import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
f = open("file.xml",'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('Topic'):
   title= doc.getElementsByTagName('Title')[i].firstChild.nodeValue
   print title
   i +=1

输出结果：

My Document
Overview
Basic Features
About This Software
Platforms Supported

回答于 2025-04-15 由 Python大师

分享举报

下面的代码可以正常运行：

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
        Title= a.childNodes[0].nodeValue
        print Title

回答于 2025-04-15 由 Python大师

分享举报

getElementsByTagName 是一个递归函数，这意味着它会找到所有符合条件的标签，包括所有子标签。因为你的 Topics 里面还有其他的 Topics，而这些 Topics 也有 Titles，所以这个函数会多次找到那些较深层的 Titles。

如果你只想找直接的子标签，而不想要所有的子标签，并且你没有 XPath 这个工具可以用，那么你可以写一个简单的过滤器，比如：

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

回答于 2025-04-15 由 Python大师

分享举报

使用Python和minidom解析XML

5 个回答

撰写回答