为什么打印到UTF-8文件会失败？

3 投票

3 回答

3319 浏览

提问于 2025-04-16 20:33

今天下午我遇到了一个问题，虽然我解决了它，但我还是不太明白为什么这样做有效。

基本上，下面的代码是不能正常工作的：

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

它会出现以下错误：

追踪记录（最近的调用在最前面）：
文件 "./temp.py"，第 25 行，
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
文件 "/usr/lib/python2.7/codecs.py"，
第 691 行，在 write
return self.writer.write(data) 文件 "/usr/lib/python2.7/codecs.py"，
第 351 行，在 write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' 编解码器
无法解码位置 66 的字节 0xd0:
序号不在范围内（128）

但是如果我不使用 codecs.open('test.xml', 'w', 'utf-8') 来打开新文件，而是用 outFile = open('test.xml', 'w')，那就能完美运行。

那么到底发生了什么呢？？

因为在 etree.tostring() 中指定了 encoding='utf-8'，这是不是又对文件进行了编码？
如果我保留 codecs.open()，并去掉 encoding='utf-8'，那么文件就变成了ascii文件。为什么呢？因为我猜 etree.tostring() 默认的编码是ascii？
但是 etree.tostring() 只是被写入到标准输出，然后重定向到一个已经创建为utf-8文件的文件中？？
- 难道 print>> 的行为不是我预期的那样？ outFile.write(etree.tostring()) 的表现也是一样的。

基本上，为什么这个不工作呢？这里到底发生了什么？这可能看起来很简单，但我显然有点困惑，并且想弄清楚为什么我的解决方案有效。

错误处理文件操作字符串处理 unicode 标准输出 utf-8 编码问题编解码器

3 个回答

除了MRAB的回答，这里还有一些代码示例：

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

回答于 2025-04-16 由 Python大师

分享举报

使用 print>>outFile 这种写法有点奇怪。我没有安装 lxml，不过内置的 xml.etree 库也差不多（但不支持 pretty_print）。你可以把 root 元素放到一个 ElementTree 里，然后用它的写入方法。

另外，如果你在代码里加了 # coding 这一行来声明源文件的编码格式，你就可以用可读的 Unicode 字符串，而不是转义代码：

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

回答于 2025-04-16 由 Python大师

分享举报

你打开了一个使用UTF-8编码的文件，这意味着它期待的是Unicode字符串。

tostring这个函数是把内容编码成UTF-8格式（也就是字节串，str），然后你把这些内容写入文件。

因为文件期待的是Unicode，所以它会用默认的ASCII编码把字节串解码成Unicode，然后再把Unicode编码成UTF-8。

可惜的是，这些字节串并不是ASCII格式。

补充一下：为了避免这种问题，最好的建议是内部使用Unicode，输入时解码，输出时编码。

回答于 2025-04-16 由 Python大师

分享举报

为什么打印到UTF-8文件会失败？

3 个回答

撰写回答