如何让Python XML停止出现多余的子节点
我有一个简单的XML文件,想用Python的DOM来读取(见下面的内容):
XML文件:
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
Python代码:
from xml.dom import minidom
xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()
for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
print item
结果:
<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">
结果应该是9个子节点(分别是Reserved、CPU、Flag、VQI、Group_ID、DI、DE、ACOSS和RGH),但不知道为什么返回的是19个节点,其中有10个是空白节点(这为什么也算是一个节点呢?!)。有没有人能告诉我,怎么让XML解析器不把空白节点算进去?
1 个回答
9
在XML中,空格是很重要的。不过你可以看看ElementTree,它处理XML的方式和DOM不一样。
示例
from xml.etree import ElementTree as et
data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
'''
tree = et.fromstring(data)
for n in tree.find('Header'):
print n.tag,'=',n.text
输出
Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8
示例(扩展之前的代码)
空格依然存在,但它们被放在了.tail
属性里。tail
是指一个元素后面的文本节点(在一个元素结束和下一个元素开始之间),而text
是指一个元素的开始标签和结束标签之间的文本节点。
def dump(e):
print '<%s>' % e.tag
print 'text =',repr(e.text)
for n in e:
dump(n)
print '</%s>' % e.tag
print 'tail =',repr(e.tail)
dump(tree)
输出
<HeaderLookup>
text = '\n '
<Header>
text = '\n '
<Reserved>
text = '2'
</Reserved>
tail = '\n '
<CPU>
text = '1'
</CPU>
tail = '\n '
<Flag>
text = '1'
</Flag>
tail = '\n '
<VQI>
text = '12'
</VQI>
tail = '\n '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n '
<DI>
text = '2'
</DI>
tail = '\n '
<DE>
text = '1'
</DE>
tail = '\n '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n '
<RGH>
text = '8'
</RGH>
tail = '\n '
</Header>
tail = '\n'
</HeaderLookup>
tail = None