python, codecs, file.writelines(), UnicodeDecodeError 中文解码错误

0 投票

1 回答

2412 浏览

数据工程师

提问于 2025-04-17 06:43

不知道怎么解决 UnicodeDecodeError 的问题：

我无法把文本写入文件，出现了 UnicodeDecodeError，提到的字符是 â = '0xe2'。

1) 确定在那个字符串里根本没有 â = '0xe2' 这个字符。

2) re.search 也找不到字符串中的 â 字符，而我正试图用 file.writelines(string) 写入。

3) 在打开文件时已经定义了 errors='replace'，所以 file.writelines() 应该不会因为字符错误而报错。

File=codecs.open(fname, 'w','utf-8', errors='replace')

lines=smart_str( lines, 'utf-8', strings_only=False, errors='replace' )
# lines is 'some webpage text after BeautifulSoup.prettify which does not contain letter â ='0xe2', which is converted with Django smart_str to string'

FileA.writelines(lines) #gives UnicodeDecodeError : 'ascii' codec can't decode byte 0xe2 in position 9637: ordinal not in range(128).

myre = re.compile(r'0xe2', re.UNICODE) # letter   â = '0xe2'
print re.search(myre, lines) #gives None
linessub=myre.sub('', lines)
print re.search(myre, linessub)  #gives None

FileA.writelines(lines) #gives UnicodeDecodeError : 'ascii' codec can't decode byte 0xe2 in position 9637: ordinal not in range(128).

字符串处理 unicode 错误调试文本编码文件写入编码错误

1 个回答

你正在使用 codecs.open，所以你的文件对象需要的是 Unicode 字符串，而不是字节字符串。

使用这个函数的好处是，你不需要在写入文件之前自己对字符串进行编码。你只需要写 Unicode 字符串，文件对象会自动处理编码。

看起来 smart_str 返回的是 UTF-8 编码的字符串（因为你传给它了编码名称）。如果你把这个字符串传给期望 Unicode 的文件对象，它会先尝试把字节字符串解码回 Unicode。由于它不知道传入字符串的编码方式，所以会默认使用 ascii。这就是错误产生的原因，因为这个字符串不是 ASCII，而是 UTF-8：

UnicodeDecodeError : 'ascii' codec can't decode...

所以，你要么跳过 smart_str 的编码步骤，直接写 Unicode 字符串到文件，或者把 codecs.open() 换成普通的 open()，后者处理的是字节，因此需要已经编码好的字节字符串。

顺便说一下，你检查 0xE2 字符是否存在的方式是行不通的。首先，你使用 r'0xe2' 作为模式，这只是一个四个字符的字符串，而不是单个的 0xE2 字符。其次，对于这么简单的事情，你根本不需要 re。你可以试试这个：

print '\xe2' in your_str

回答于 2025-04-17 由 Python大师

分享举报

python, codecs, file.writelines(), UnicodeDecodeError 中文解码错误

1 个回答

撰写回答