Python, Unicode解码错误

8 投票

8 回答

44204 浏览

提问于 2025-04-15 16:11

我遇到了这个错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128)

我尝试设置了很多不同的编码方式（在文件头部，比如 # -*- coding: utf8 -*-），甚至还用了 u"字符串"，但错误还是出现了。

我该怎么解决这个问题呢？

补充说明：我不知道具体是哪个字符导致了这个错误，但因为这是一个会递归浏览文件夹的程序，所以它肯定遇到了一个名字里有奇怪字符的文件。

代码：

# -*- coding: utf8 -*-


# by TerabyteST

###########################

# Explores given path recursively
# and finds file which size is bigger than the set treshold

import sys
import os

class Explore():
    def __init__(self):
        self._filelist = []

    def exploreRec(self, folder, treshold):
        print folder
        generator = os.walk(folder + "/")
        try:
            content = generator.next()
        except:
            return
        folders = content[1]
        files = content[2]
        for n in folders:
            if "$" in n:
                folders.remove(n)
        for f in folders:
            self.exploreRec(u"%s/%s"%(folder, f), treshold)
        for f in files:
            try:
                rawsize = os.path.getsize(u"%s/%s"%(folder, f))
            except:
                print "Error reading file %s"%u"%s/%s"%(folder, f)
                continue
            mbsize = rawsize / (1024 * 1024.0)
            if mbsize >= treshold:
                print "File %s is %d MBs!"%(u"%s/%s"%(folder, f), mbsize)

错误信息：

Traceback (most recent call last):
  File "<pyshell#19>", line 1, in <module>
    a.exploreRec("C:", 100)
  File "D:/Python/Explorator/shitfinder.py", line 35, in exploreRec
    print "Error reading file %s"%u"%s/%s"%(folder, f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128)

这是使用 print repr("读取文件时出错 %s"%u"%s/%s"%(folder.decode('utf-8','ignore'), f.decode('utf-8','ignore'))) 显示的内容

>>> a = Explore()
>>> a.exploreRec("C:", 100)
File C:/Program Files/Ableton/Live 8.0.4/Resources/DefaultPackages/Live8Library_v8.2.alp is 258 MBs!
File C:/Program Files/Adobe/Reader 9.0/Setup Files/{AC76BA86-7AD7-1040-7B44-A90000000001}/Data1.cab is 114 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/Art1.bar is 393 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/art2.bar is 396 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/art3.bar is 228 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/Sound/Sound.bar is 273 MBs!
File C:/ProgramData/Microsoft/Search/Data/Applications/Windows/Windows.edb is 162 MBs!
REPR:
u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror's Edge.lnk"
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror's Edge.lnk
REPR:
u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror's Edge.lnk"
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror's Edge.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk

Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    a.exploreRec("C:", 100)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
  File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
    self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x99 in position 78: ordinal not in range(128)
>>>

unicode 文件处理递归遍历错误调试编码错误字符串编码

8 个回答

你正在尝试对一个包含非ASCII字符的unicode字符串执行某些操作（比如打印），但这个字符串默认被转换成了ASCII格式。你需要指定编码方式，才能正确显示这个字符串。
如果你能提供一些你正在尝试的代码示例，那会很有帮助。

最简单的方法是：
s = u'ma\xf1ana';
print s.encode('latin-1');

在问题中添加了细节后进行的编辑：

在你的情况下，你需要先解码你读取的字符串：
f.decode();,
所以可以尝试将
u"%s/%s" % (folder, f)
改成
os.path.join(folder, f.decode())

注意，'latin-1'编码可能需要根据你的文件命名进行更改。

附注：John Machin提到了一些非常有用的方法来改进和清理代码。+1

回答于 2025-04-15 由 Python大师

分享举报

Python 默认使用 ASCII 编码，这让人有点烦。如果你想永久改变这个设置，可以找到并编辑一个叫 site.py 的文件。在里面搜索 def setencoding()，然后在下面几行把 encoding = "ascii" 改成 encoding = "utf-8"。这样就可以告别默认的 ASCII 编码了。

回答于 2025-04-15 由 Python大师

分享举报

我们无法猜测你想做什么，也不知道你的代码里有什么，更不知道“设置很多不同的编解码器”是什么意思，或者u"字符串"对你来说应该做什么。

请把你的代码改回最初的状态，尽量让它反映出你想做的事情，然后再运行一次，并编辑你的问题，提供以下信息：(1) 你得到的完整错误追踪和错误信息 (2) 包含追踪中最后一条语句的代码片段 (3) 简要描述你希望代码完成的任务 (4) 你正在使用的Python版本。

在添加了详细信息后编辑：

(0) 让我们尝试对出错的语句进行一些变换：

原始代码：
print "Error reading file %s"%u"%s/%s"%(folder, f)
为了减少可读性问题，加上空格：
print "Error reading file %s" % u"%s/%s" % (folder, f)
加上括号以强调计算顺序：
print ("Error reading file %s" % u"%s/%s") % (folder, f)
计算括号中的（常量）表达式：
print u"Error reading file %s/%s" % (folder, f)

这真的是你想要的吗？建议：使用更好的方法一次性构建路径（见下面的第(2)点）。

(1) 通常情况下，使用 repr(foo) 或 "%r" % foo 来进行诊断。这样，你的诊断代码就不太可能引发异常（就像这里发生的那样），而且可以避免歧义。在尝试获取大小之前，插入语句 print repr(folder), repr(f)，然后重新运行并反馈结果。

(2) 不要通过 u"%s/%s" % (folder, filename) 来构建路径... 使用 os.path.join(folder, filename)

(3) 不要使用裸的except，检查已知问题。为了让未知问题不再未知，可以这样做：

try:
    some_code()
except ReasonForBaleOutError:
    continue
except: 
    # something's gone wrong, so get diagnostic info
    print repr(interesting_datum_1), repr(interesting_datum_2)
    # ... and get traceback and error message
    raise

一种更复杂的方法是使用日志记录而不是打印，但以上的方法比不知道发生了什么要好得多。

进一步编辑 在查看了 rtm("os.walk")，回忆起旧传说，并重新阅读你的代码后：

(4) os.walk() 会遍历整个树；你不需要递归调用它。

(5) 如果你将一个unicode字符串传递给os.walk()，结果（路径、文件名）会以unicode形式返回。你不需要那些u"blah"的东西。然后你只需要选择如何显示这些unicode结果。

(6) 删除路径中包含"$"的部分：你必须在原地修改列表，但你的方法很危险。试试这样的做法：

for i in xrange(len(folders), -1, -1):
    if '$' in folders[i]:
        del folders[i]

(7) 你通过连接文件夹名和文件名来引用文件。你使用的是原始的文件夹名；当你去掉递归时，这样做是行不通的；你需要使用os.walk报告的当前被丢弃的 content[0] 值。

(8) 你应该使用一些非常简单的方式，比如：

for folder, subfolders, filenames in os.walk(unicoded_top_folder):

没有必要使用 generator = os.walk(...); try: content = generator.next() 等等，顺便说一句，如果你将来需要使用 generator.next()，请用 except StopIteration 替代裸的except。

(9) 如果调用者提供了一个不存在的文件夹，不会引发异常，它只是不会做任何事情。如果提供的文件夹存在但为空，也是如此。如果你需要区分这两种情况，你需要自己进行额外的测试。

对OP的这条评论的回应： """谢谢，请查看我在第一条帖子中使用repr()显示的信息。我不知道为什么它打印了这么多不同的项目，但看起来它们都有问题。它们的共同点是它们都是.lnk文件。这可能是问题吗？另外，在最后的那些firefox文件中，它打印（Modalitrovvisoria），而资源管理器中的真实文件名包含（Modalità provvisoria）"""

(10) 嗯，那不是 ".INK".lower()，而是 ".LNK".lower() ... 也许你需要更改你阅读时使用的字体。

(11) 所有“问题”文件名都以“.lnk”结尾，可能与os.walk()和/或Windows对这些文件名做了特殊处理有关。

(12) 我在这里重复你用来生成该输出的Python语句，并引入了一些空格：

print repr(
    "Error reading file %s" \
    % u"%s/%s" % (
        folder.decode('utf-8','ignore'),
        f.decode('utf-8','ignore')
        )
    )

看来你没有阅读、理解或只是忽略了我在另一条答案中的评论给你的建议：UTF-8在Windows文件系统的文件名上下文中并不相关。

我们关心的是folder和f具体指的是什么。你通过尝试使用UTF-8解码而破坏了所有证据。你使用“ignore”选项使得问题更加复杂。如果你使用“replace”选项，你会看到“( Modalit\ufffdrovvisoria)”。在调试中，“ignore”选项是没有用的。

无论如何，一些文件名出现某种错误，但在使用“ignore”选项时似乎没有丢失字符（或看起来没有被破坏）是可疑的。

“插入语句 print repr(folder), repr(f)”这部分你没理解吗？你只需要做这样的事情：

print "Some meaningful text" # "error reading file" isn't
print "folder:", repr(folder)
print "f:", repr(f)

(13) 看起来你在代码的其他地方引入了UTF-8，根据追踪信息：self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)

我想指出的是，你仍然不知道folder和f是指str对象还是unicode对象，而两个答案都建议它们很可能是str对象，那么为什么要引入blahbah.encode()呢？

更一般的一点：在更改脚本之前，尝试理解你的问题是什么。胡乱尝试每个建议，加上几乎没有有效的调试技巧，并不是前进的方向。

(14) 当你再次运行脚本时，你可能想通过在C:\的一些子集上运行它来减少输出量... 特别是如果你继续执行我最初的建议，打印所有文件名的调试信息，而不仅仅是错误的文件名（了解非错误文件名的样子可能有助于理解问题）。

对Bryan McLemore的“清理”函数的回应：

(15) 这里是一个注释的交互式会话，说明os.walk()和非ASCII文件名实际发生了什么：

C:\junk\terabytest>dir
[snip]
 Directory of C:\junk\terabytest

20/11/2009  01:28 PM    <DIR>          .
20/11/2009  01:28 PM    <DIR>          ..
20/11/2009  11:48 AM    <DIR>          empty
20/11/2009  01:26 PM                11 Hašek.txt
20/11/2009  01:31 PM             1,419 tbyte1.py
29/12/2007  09:33 AM                 9 Ð.txt
               3 File(s)          1,439 bytes
[snip]

C:\junk\terabytest>\python26\python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pprint import pprint as pp
>>> import os

os.walk(unicode_string) -> 返回unicode对象

>>> pp(list(os.walk(ur"c:\junk\terabytest")))
[(u'c:\\junk\\terabytest',
  [u'empty'],
  [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']),
 (u'c:\\junk\\terabytest\\empty', [], [])]

os.walk(str_string) -> 返回str对象

>>> pp(list(os.walk(r"c:\junk\terabytest")))
[('c:\\junk\\terabytest',
  ['empty'],
  ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']),
 ('c:\\junk\\terabytest\\empty', [], [])]

cp1252是我系统上预期使用的编码...

>>> u'\u0161'.encode('cp1252')
'\x9a'
>>> 'Ha\x9aek'.decode('cp1252')
u'Ha\u0161ek'

用UTF-8解码str不工作，正如预期的那样

>>> 'Ha\x9aek'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python26\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte

任何随机字节字符串都可以使用latin1无错误地解码

>>> 'Ha\x9aek'.decode('latin1')
u'Ha\x9aek'

但是U+009A是一个控制字符（单字符引入），即毫无意义的乱码；与正确答案毫无关系

>>> unicodedata.name(u'\u0161')
'LATIN SMALL LETTER S WITH CARON'
>>>

(16) 这个例子展示了当字符可以在默认字符集中表示时会发生什么；如果不能呢？这是一个例子（这次使用IDLE），文件名包含CJK汉字，这些字符在我的默认字符集中肯定无法表示：

IDLE 2.6.4      
>>> import os
>>> from pprint import pprint as pp

repr(Unicode结果)看起来很好

>>> pp(list(os.walk(ur"c:\junk\terabytest\chinese")))
[(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])]

而且unicode在IDLE中显示得很好：

>>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0]
nihao你好.txt

str结果显然是通过使用.encode(whatever, "replace")生成的——这并不太有用，例如，你不能通过传递这个作为文件名来打开文件。

>>> pp(list(os.walk(r"c:\junk\terabytest\chinese")))
[('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])]

所以结论是，为了获得最佳结果，应该将unicode字符串传递给os.walk()，并处理任何显示问题。

回答于 2025-04-15 由 Python大师

分享举报

Python, Unicode解码错误

8 个回答

撰写回答