<p>我将从这个问题开始:“是否有一个替代的解析器,我可以使用它可能不那么严格,并允许utf-8字符?”</p>
<p>所有XML解析器都将接受用UTF-8编码的数据。事实上,UTF-8是默认编码。</p>
<p>XML文档可以以如下声明开头:</p>
<pre><code>`<?xml version="1.0" encoding="UTF-8"?>`
</code></pre>
<p>或者像这样:
<code><?xml version="1.0"?></code>
或者根本没有声明。。。在每种情况下,解析器都将使用UTF-8对文档进行解码。</p>
<p>但是您的数据不是用UTF-8编码的。。。可能是Windows-1252,也就是cp1252。</p>
<p>如果编码不是UTF-8,那么创建者应该包含一个声明(或者接收者可以在声明前面加一个声明),或者接收者可以将数据转换为UTF-8。以下展示了哪些有效哪些无效:</p>
<pre><code>>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio
>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration
>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8
>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again
>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works
>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception
>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8
>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed
</code></pre>