ElementTree的替代XML解析器，以缓解UTF-8的困境？问题的回答

ElementTree的替代XML解析器，以缓解UTF-8的困境？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我将从这个问题开始：“是否有一个替代的解析器，我可以使用它可能不那么严格，并允许utf-8字符？” 所有XML解析器都将接受用UTF-8编码的数据。事实上，UTF-8是默认编码。 XML文档可以以如下声明开头： <pre><code>`<?xml version="1.0" encoding="UTF-8"?>` </code></pre> 或者像这样： <code><?xml version="1.0"?></code> 或者根本没有声明。。。在每种情况下，解析器都将使用UTF-8对文档进行解码。 但是您的数据不是用UTF-8编码的。。。可能是Windows-1252，也就是cp1252。 如果编码不是UTF-8，那么创建者应该包含一个声明（或者接收者可以在声明前面加一个声明），或者接收者可以将数据转换为UTF-8。以下展示了哪些有效哪些无效： <pre><code>>>> import xml.etree.ElementTree as ET >>> from StringIO import StringIO as sio >>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration >>> t = ET.parse(sio(raw_text)) [tracebacks omitted] xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9 # parser is expecting UTF-8 >>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text)) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47 # parser is expecting UTF-8 again >>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text)) >>> t.getroot().text u'can\u2019t' # parser was told to expect cp1252; it works >>> import unicodedata >>> unicodedata.name(u'\u2019') 'RIGHT SINGLE QUOTATION MARK' # not quite an apostrophe, but better than an exception >>> fixed_text = raw_text.decode('cp1252').encode('utf8') # alternative: we transcode the data to UTF-8 >>> t = ET.parse(sio(fixed_text)) >>> t.getroot().text u'can\u2019t' # UTF-8 is the default; no declaration needed </code></pre>

ElementTree的替代XML解析器，以缓解UTF-8的困境？

1 个回答

相关Python问题