如何用Python的cElementTree创建<!DOCTYPE>

17 投票
4 回答
20451 浏览
提问于 2025-04-17 10:16

我试着用这个问题里的答案,但没法让它工作:如何用Python的ElementTree创建“虚拟根”?

这是我的代码:

import xml.etree.cElementTree as ElementTree
from StringIO import StringIO
s = '<?xml version=\"1.0\" encoding=\"UTF-8\" ?><!DOCTYPE tmx SYSTEM \"tmx14a.dtd\" ><tmx version=\"1.4a\" />'
tree = ElementTree.parse(StringIO(s)).getroot()
header = ElementTree.SubElement(tree,'header',{'adminlang': 'EN',})
body = ElementTree.SubElement(tree,'body')
ElementTree.ElementTree(tree).write('myfile.tmx','UTF-8')

当我打开生成的'myfile.tmx'文件时,它里面包含了这些内容:

<?xml version='1.0' encoding='UTF-8'?>
<tmx version="1.4a"><header adminlang="EN" /><body /></tmx>

我漏掉了什么?或者,有没有更好的工具?

4 个回答

2

我用了不同的方法来添加DOCTYPE,方法很简单,也有点傻。

import xml.etree.ElementTree as ET

with open(path_file, "w", encoding='UTF-8') as xf:
    doc_type = '<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE dlg:window ' \
               'PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "dialog.dtd">'
    tostring = ET.tostring(root).decode('utf-8')
    file = f"{doc_type}{tostring}"
    xf.write(file)
17

你可以在 write 函数中把 xml_declaration 这个参数设置为 False,这样输出的内容就不会包含编码的 XML 声明。然后你可以手动添加你需要的头部信息。其实,如果你把编码设置为 'utf-8'(小写),XML 声明也不会被添加。

import xml.etree.cElementTree as ElementTree

tree = ElementTree.Element('tmx', {'version': '1.4a'})
ElementTree.SubElement(tree, 'header', {'adminlang': 'EN'})
ElementTree.SubElement(tree, 'body')

with open('myfile.tmx', 'wb') as f:
    f.write('<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE tmx SYSTEM "tmx14a.dtd">'.encode('utf8'))
    ElementTree.ElementTree(tree).write(f, 'utf-8')

生成的文件(为了可读性手动添加了换行符):

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
    <header adminlang="EN" />
    <body />
</tmx>
13

你可以使用 lxml 这个库,以及它的 tostring 函数:

from lxml import etree

s = """<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4a"/>""" 

tree = etree.fromstring(s)
header = etree.SubElement(tree,'header',{'adminlang': 'EN'})
body = etree.SubElement(tree,'body')

print etree.tostring(tree, encoding="UTF-8",
                     xml_declaration=True,
                     pretty_print=True,
                     doctype='<!DOCTYPE tmx SYSTEM "tmx14a.dtd">')

=>

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
  <header adminlang="EN"/>
  <body/>
</tmx>

撰写回答