Python 3.3: 将XML转换为YAML
我正在尝试用 Python 3.3 将 XML 文件转换成 YAML 格式。以下是我的代码:
#! /etc/python3
test_filename_input = './reference-conversions/wikipedia-example.xml'
test_filename_output = 'wikipedia-example_xml_read-as-binary.yaml'
file_object = open( test_filename_input, 'rb')
data_in = file_object.read()
file_object.close()
from xml.dom.minidom import parseString
document_object = parseString( data_in)
import yaml
stream = open( test_filename_output, 'w')
yaml.dump( document_object, stream)
stream.close()
作为参考,我使用了这个 XML 文件:这里:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumbers>
<phoneNumber type="home">212 555-1234</phoneNumber>
<phoneNumber type="fax">646 555-4567</phoneNumber>
</phoneNumbers>
<gender>
<type>male</type>
</gender>
</person>
转换后应该得到类似这样的结果:
---
firstName: John
lastName: Smith
age: 25
address:
streetAddress: 21 2nd Street
city: New York
state: NY
postalCode: 10021
phoneNumber:
-
type: home
number: 212 555-1234
-
type: fax
number: 646 555-4567
gender:
type: male
但是,实际得到的结果是:
&id001 !!python/object/new:xml.dom.minidom.Document
state: !!python/tuple
- implementation: !!python/object:xml.dom.minidom.DOMImplementation {}
- _elem_info: {}
_id_cache: {}
_id_search_stack: null
childNodes: !!python/object/new:xml.dom.minicompat.NodeList
listitems:
- &id039 !!python/object/new:xml.dom.minidom.Element
state: !!python/tuple
- null
- _attrs: null
_attrsNS: null
childNodes: !!python/object/new:xml.dom.minicompat.NodeList
listitems:
- &id045 !!python/object/new:xml.dom.minidom.Text
state: !!python/tuple
- null
- _data: "\n "
nextSibling: &id002 !!python/object/new:xml.dom.minidom.Element
state: !!python/tuple
- null
- _attrs: null
_attrsNS: null
childNodes: !!python/object/new:xml.dom.minicompat.NodeList
listitems:
[...]
有没有人知道,怎么让 PyYAML 从 xml.dom.minidom 中过滤掉对象的内容,或者有没有其他方法可以替代使用 xml.dom.minidom?
谢谢!
3 个回答
0
使用 https://pypi.org/project/yaplon/ -> https://github.com/twardoch/yaplon/
xml22yaml -i "file.xml" -o "file.yaml"
不过,它不支持带有BOM的UTF-8文件。
3
我找到了一款XML转YAML的工具,不过我在大约第92行做了一个小改动:
outStr = yaml.dump(out)
改成了
outStr = yaml.safe_dump(out)
这个改动是为了去掉输出中的任何!!python/unicode
标签。我通过命令行测试了这个脚本,运行得很好;我相信只需要简单地调整一下,就能让它在Python命令行中正常工作。
补充说明
我还添加了自己的打印方法,让输出看起来更像你最初发布的内容:
def prettyPrint(node, level):
childPrint = 0
attrPrint = 0
for x in node:
try:
if x['attributes']:
attrPrint = 1
for l in range(0, level):
sys.stdout.write("\t")
for a in x['attributes']:
sys.stdout.write("- %s: %s\n" % (a, x['attributes'][a]))
except KeyError:
try:
if x['children']:
childPrint = 1
for l in range(0, level):
sys.stdout.write("\t")
sys.stdout.write("%s:\n" % x['name'])
prettyPrint(x['children'], level+1)
except KeyError:
pass
finally:
if not childPrint:
printNextNode(x, level, attrPrint)
attrPrint = 0
else:
childPrint = 0
def printNextNode(node, level, attrPrint):
for l in range(0, level):
sys.stdout.write("\t")
if attrPrint:
sys.stdout.write(' ')
sys.stdout.write("%s: %s\n" % (node['name'], node['text']))
这个方法是在convertXml2Yaml
函数中调用的:
sys.stdout.write('%s:\n' % out['name'])
prettyPrint(out['children'], 1)
8
这里有一种方法,可以解决xml.dom的一些问题,并提供了一种处理节点同时拥有内容、属性或子节点的模糊情况的方式。对于上面的示例输入,它的输出是:
$ python3 yamlout.py person.xml
---
person:
firstName: John
lastName: Smith
age: 25
address:
streetAddress: 21 2nd Street
city: New York
state: NY
postalCode: 10021
phoneNumbers:
phoneNumber:
_xml_node_content: 212 555-1234
type: home # Attribute
phoneNumber:
_xml_node_content: 646 555-4567
type: fax # Attribute
gender:
type: male
这个实现的文件名是yamlout.py:
import sys
import json
import xml.etree.ElementTree as ET
if len(sys.argv) != 2:
sys.stderr.write("Usage: {0} <file>.xml".format(sys.argv[0]))
XML_NODE_CONTENT = '_xml_node_content'
ATTR_COMMENT = '# Attribute'
def yamlout(node, depth=0):
if not depth:
sys.stdout.write('---\n')
# Nodes with both content AND nested nodes or attributes
# have no valid yaml mapping. Add 'content' node for that case
nodeattrs = node.attrib
children = list(node)
content = node.text.strip() if node.text else ''
if content:
if not (nodeattrs or children):
# Write as just a name value, nothing else nested
sys.stdout.write(
'{indent}{tag}: {text}\n'.format(
indent=depth*' ', tag=node.tag, text=content or ''))
return
else:
# json.dumps for basic handling of multiline content
nodeattrs[XML_NODE_CONTENT] = json.dumps(node.text)
sys.stdout.write('{indent}{tag}:\n'.format(
indent=depth*' ', tag=node.tag))
# Indicate difference node attributes and nested nodes
depth += 1
for n,v in nodeattrs.items():
sys.stdout.write(
'{indent}{n}: {v} {c}\n'.format(
indent=depth*' ', n=n, v=v,
c=ATTR_COMMENT if n!=XML_NODE_CONTENT else ''))
# Write nested nodes
for child in children:
yamlout(child, depth)
with open(sys.argv[1]) as xmlf:
tree = ET.parse(xmlf)
yamlout(tree.getroot())