在Python中将ISO-8859-2的XML文件转换为UTF-8

-1 投票
1 回答
41 浏览
提问于 2025-04-12 18:19

我需要你的帮助来解决一个编码问题,看起来情况有点复杂。

我有很多输入文件,它们的格式都和下面这个差不多:

<?xml version='1.0' encoding='iso-8859-1'?>
  <root>
    <Module name="ModuleName">
      <Parameter Value="Data01$|Data02F1F5$|Data03:$|Data04 : $|"/>
    </Module>
  </root>

我需要能够解析这个文件,但里面有很多特殊字符,你可以在下面看到:

在这里输入图片描述

我不能使用lxml或者beautiful soup这些工具。

我尝试了下面的不同方法,但还是找不到解决办法:

from  xml.etree import ElementTree

file = 'StackOverflow.xml'

with open(file, 'r', encoding = 'iso-8859-1') as f:
    string = f.read()
    print(string)
with open(file, 'w', encoding = 'utf-8') as f:
    f.write(string)
    
with open(file, 'rb') as f :
    root = ElementTree.fromstring(f.read())

tree = ElementTree.ElementTree(root)
tree.write(file, encoding='utf-8', xml_declaration = True)

with open(file, 'rb') as f:
    parser = etree.XMLParser(encoding = "iso-8859-1")
    root = etree.parse(f, parser)
      
string = etree.tostring(root, xml_declaration = True, encoding="utf-8").decode('utf-8').encode('iso-8859-1')

with open('file', 'wb') as f:
    target.write(string)

1 个回答

0

我无法重现你遇到的问题:

import xml.etree.ElementTree as ET

xml_file_path = "StackOverFlow.xml"

tree = ET.parse(xml_file_path)
root = tree.getroot()

for elem in root.iter():
    print(elem.tag, elem.attrib)

输出结果是:

root {}
Module {'name': 'ModuleName'}
Parameter {'Value': 'Data01$|Data02F1F5$|Data03:$|Data04 : $|'}

你的图片显示的是utf-8:

import xml.etree.ElementTree as ET

#even with utf-8 it works:
xml_str = """<?xml version='1.0' encoding='utf-8'?>
<root>
  <Module name="ModuleName">
    <Parameter Value="Data01$|Data02F1F5$|Data03:$|Data04 : $|" />
  </Module>
</root>"""

root = ET.fromstring(xml_str)

for elem in root.iter():
    print(elem.tag, elem.attrib)

输出结果也正常:

root {}
Module {'name': 'ModuleName'}
Parameter {'Value': 'Data01$|Data02F1F5$|Data03:$|Data04 : $|'}

撰写回答