ElementTree Unicode 编码/解码错误

Question

我正在做一个项目，需要对一些XML文件进行增强并保存到文件中。但是我遇到了一个问题，老是出现以下错误：

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 193, in parse_references
    outputXML = ET.tostring(root, encoding='utf8', method='xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
 ECLI:NL:RVS:2012:BY1564
 File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)

这个错误是由以下代码产生的：

outputXML = ET.tostring(root, encoding='utf8', method='xml')

在寻找解决这个问题的方法时，我发现有几个建议说我应该在函数中添加 .decode('utf-8')，但是这样做导致了一个编码错误（最开始是解码错误），所以这个方法不行...

这个编码错误是：

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 197, in parse_references
    myfile.write(outputXML)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 13559: ordinal not in range(128)

它是由以下代码生成的：

outputXML = ET.tostring(root, encoding='utf8', method='xml').decode('utf-8')

源代码（或者说是相关部分）：

# URL encodes the parameters
encoded_parameters = urllib.urlencode({'id':ecli})

# Opens XML file
feed = urllib2.urlopen("http://data.rechtspraak.nl/uitspraken/content?"+encoded_parameters, timeout = 3)

# Parses the XML
ecliFile = ET.parse(feed)

# Fetches root element of current tree
root = ecliFile.getroot()

# Write the XML to a file without any extra indents or newlines
outputXML = ET.tostring(root, encoding='utf8', method='xml')

# Write the XML to the file
with open(file, "w") as myfile:
    myfile.write(outputXML)

最后，还有一个XML示例的链接： http://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2012:BY1542

XML error handling unicode data serialization elementtree file processing encoding decoding

ElementTree Unicode 编码/解码错误

1 个回答

撰写回答