我最近意识到,在某些标记的正文文本中包含HTML标记的XML似乎使WP之类的解析器都感到窒息。在
因此,为了缓解这种情况,我尝试编写一个Python脚本来正确地输出XML。在
它从以下XML文件开始(这只是一个节选):
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
</Row>
...
</Root>
所需输出为:
^{pr2}$不幸的是,我得到了以下奇怪的转义字符:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
所以我想解决以下问题: 1) 输出新的XML文件,该文件保留文本,包括新引入的“Introduction_Body”标记中的HTML,以及任何其他标记,如“fallower_Name” 2) 有没有可能把这个打印得很干净(为了人类的可读性)?怎样?在
我的Python代码当前如下所示:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
intro_tree = directory+introduction
with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
intro_text = f.read()
intro_body = ET.SubElement(element,'Introduction_Body')
intro_body.text = '<![CDATA[' + intro_text + ']]>'
#tree.write('new_' + data_file) #same result but leaves out the xml header
f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()
谢谢, 约翰尼
我建议您切换到^{} 。它有很好的文档并且(几乎)完全兼容python自己的
xml
。您可能只需要对代码进行最小程度的更改。lxml
非常方便地支持CDATA
:除此之外,您绝对应该使用任何库,不仅用于解析xml,而且还用于编写xml!
^{pr2}$lxml
将为您声明:相关问题 更多 >
编程相关推荐