在Python中将ISO-8859-2的XML文件转换为UTF-8
我需要你的帮助来解决一个编码问题,看起来情况有点复杂。
我有很多输入文件,它们的格式都和下面这个差不多:
<?xml version='1.0' encoding='iso-8859-1'?>
<root>
<Module name="ModuleName">
<Parameter Value="Data01$|Data02F1F5$|Data03:$|Data04 : $|"/>
</Module>
</root>
我需要能够解析这个文件,但里面有很多特殊字符,你可以在下面看到:
我不能使用lxml或者beautiful soup这些工具。
我尝试了下面的不同方法,但还是找不到解决办法:
from xml.etree import ElementTree
file = 'StackOverflow.xml'
with open(file, 'r', encoding = 'iso-8859-1') as f:
string = f.read()
print(string)
with open(file, 'w', encoding = 'utf-8') as f:
f.write(string)
with open(file, 'rb') as f :
root = ElementTree.fromstring(f.read())
tree = ElementTree.ElementTree(root)
tree.write(file, encoding='utf-8', xml_declaration = True)
with open(file, 'rb') as f:
parser = etree.XMLParser(encoding = "iso-8859-1")
root = etree.parse(f, parser)
string = etree.tostring(root, xml_declaration = True, encoding="utf-8").decode('utf-8').encode('iso-8859-1')
with open('file', 'wb') as f:
target.write(string)
1 个回答
0
我无法重现你遇到的问题:
import xml.etree.ElementTree as ET
xml_file_path = "StackOverFlow.xml"
tree = ET.parse(xml_file_path)
root = tree.getroot()
for elem in root.iter():
print(elem.tag, elem.attrib)
输出结果是:
root {}
Module {'name': 'ModuleName'}
Parameter {'Value': 'Data01$|Data02F1F5$|Data03:$|Data04 : $|'}
你的图片显示的是utf-8:
import xml.etree.ElementTree as ET
#even with utf-8 it works:
xml_str = """<?xml version='1.0' encoding='utf-8'?>
<root>
<Module name="ModuleName">
<Parameter Value="Data01$|Data02F1F5$|Data03:$|Data04 : $|" />
</Module>
</root>"""
root = ET.fromstring(xml_str)
for elem in root.iter():
print(elem.tag, elem.attrib)
输出结果也正常:
root {}
Module {'name': 'ModuleName'}
Parameter {'Value': 'Data01$|Data02F1F5$|Data03:$|Data04 : $|'}