Python xml.dom.minidom生成无效的XML?
我遇到了一个奇怪的问题,跟python的xml.dom.minidom这个包有关。我生成了一个文档,并用从终端获取的数据填充它。有时候,这些数据里会包含一些终端控制字符。当我用minidom.toprettyxml()
把这些字符存储到文本数据节点时,一切看起来都没问题,但生成的文档却不是一个有效的XML。
有没有人知道为什么minidom会允许生成无效的文档?这和“mini”这个词有关系吗?
下面是提取的示例代码(还有一些系统信息):
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.dom import minidom
>>> impl = minidom.getDOMImplementation()
>>> doc = impl.createDocument(None, "results", None)
>>> root = doc.firstChild
>>> outString = "test "+chr(1) #here goes control character
>>> root.appendChild(doc.createTextNode(outString))
<DOM Text node "'test \x01'">
>>> doc.toprettyxml(encoding="utf-8")
'<?xml version="1.0" encoding="utf-8"?>\n<results>\n\ttest \x01\n</results>\n'
>>> with open("/tmp/outfile", "w") as f:
... f.write(doc.toprettyxml(encoding="utf-8"))
...
>>> doc2 = minidom.parse("/tmp/outfile")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3, column 6
>>> open("/tmp/outfile","r").readlines()
['<?xml version="1.0" encoding="utf-8"?>\n', '<results>\n', '\ttest \x01\n', '</results>\n']
>>>
1 个回答
1
看一下 _write_data 这个代码,它只处理了&符号、斜杠和括号:
def _write_data(writer, data):
"Writes datachars to writer."
data = data.replace("&", "&").replace("<", "<")
data = data.replace("\"", """).replace(">", ">")
writer.write(data)
正如你猜测的,minidom 并不是一个非常强大的实现(比如它对命名空间的处理就不太完善)。