ElementTree Unicode 编码/解码错误
我正在做一个项目,需要对一些XML文件进行增强并保存到文件中。但是我遇到了一个问题,老是出现以下错误:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 193, in parse_references
outputXML = ET.tostring(root, encoding='utf8', method='xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
ECLI:NL:RVS:2012:BY1564
File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
write(_escape_cdata(text, encoding))
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)
这个错误是由以下代码产生的:
outputXML = ET.tostring(root, encoding='utf8', method='xml')
在寻找解决这个问题的方法时,我发现有几个建议说我应该在函数中添加 .decode('utf-8')
,但是这样做导致了一个编码错误(最开始是解码错误),所以这个方法不行...
这个编码错误是:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 197, in parse_references
myfile.write(outputXML)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 13559: ordinal not in range(128)
它是由以下代码生成的:
outputXML = ET.tostring(root, encoding='utf8', method='xml').decode('utf-8')
源代码(或者说是相关部分):
# URL encodes the parameters
encoded_parameters = urllib.urlencode({'id':ecli})
# Opens XML file
feed = urllib2.urlopen("http://data.rechtspraak.nl/uitspraken/content?"+encoded_parameters, timeout = 3)
# Parses the XML
ecliFile = ET.parse(feed)
# Fetches root element of current tree
root = ecliFile.getroot()
# Write the XML to a file without any extra indents or newlines
outputXML = ET.tostring(root, encoding='utf8', method='xml')
# Write the XML to the file
with open(file, "w") as myfile:
myfile.write(outputXML)
最后,还有一个XML示例的链接: http://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2012:BY1542
1 个回答
6
这个错误是因为你用了一个字节字符串的值。
在错误信息中提到的text
应该是一个Unicode值,但如果它只是一个普通的字节字符串,Python会先用ASCII编码把它解码成Unicode,这样你才能再编码回去。
问题就出在这个解码的过程中。
因为你没有告诉我们你往XML树里插入了什么,所以很难给你具体的修复建议,除了确保在插入文本时总是使用Unicode值。
示例:
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'.encode('utf8')
>>> ET.tostring(root, encoding='utf8', method='xml')
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
v = _escape_attrib(v, encoding)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 31: ordinal not in range(128)
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'
>>> ET.tostring(root, encoding='utf8', method='xml')
'<?xml version=\'1.0\' encoding=\'utf8\'?> ...'
如果你设置了一个字节字符串属性,而这个字符串包含了ASCII范围以外的字节,就会引发这个错误;而使用Unicode值则可以确保结果正常生成。