Python - 解码带BOM的UTF-16文件

22 投票

2 回答

30505 浏览

提问于 2025-04-17 22:41

我有一个带有字节顺序标记（BOM）的UTF-16 LE文件。我想把这个文件转换成没有BOM的UTF-8格式，这样我就可以用Python来处理它。

我通常用的代码没能成功，返回了一些未知字符，而不是文件的实际内容。

f = open('dbo.chrRaces.Table.sql').read()
f = str(f).decode('utf-16le', errors='ignore').encode('utf8')
print f

那么，正确的解码方法是什么，这样我才能用f.readlines()来读取这个文件呢？

2 个回答

这个在Python 3中可以运行：

f  = open('test_utf16.txt', mode='r', encoding='utf-16').read()
print(f)

回答于 2025-04-17 由 Python大师

分享举报

首先，你应该以二进制模式读取文件，不然会让事情变得复杂。

然后，检查并去掉BOM（字节顺序标记），因为它是文件的一部分，但不是实际文本的一部分。

import codecs
encoded_text = open('dbo.chrRaces.Table.sql', 'rb').read()    #you should read in binary mode to get the BOM correctly
bom = codecs.BOM_UTF16_LE                                      #print dir(codecs) for other encodings
assert encoded_text.startswith(bom)                           #make sure the encoding is what you expect, otherwise you'll get wrong data
encoded_text = encoded_text[len(bom):]                         #strip away the BOM
decoded_text = encoded_text.decode('utf-16le')                 #decode to unicode

在完成所有解析和处理之前，不要进行编码（比如转成utf-8等）。你应该使用unicode字符串来完成这些操作。

另外，在decode时使用errors='ignore'可能不是个好主意。想想看，哪种情况更糟：你的程序告诉你出错并停止，还是返回错误的数据呢？

回答于 2025-04-17 由 Python大师

分享举报

Python - 解码带BOM的UTF-16文件

2 个回答

撰写回答