用Python中的BOM字符读取Unicode文件数据

bytes = min(32, os.path.getsize(filename)) raw = open(filename, 'rb').read(bytes) result = chardet.detect(raw) encoding = result['encoding'] infile = open(filename, mode, encoding=encoding) data = infile.read() infile.close() print(data)

3条回答

网友

1楼 · 编辑于 2024-05-16 19:01:13

除非显式使用utf-8-sig编码，否则在对UTF-16（而不是UTF-8）进行解码时，应该自动剥离BOM字符。你可以试试这样的：

import io
import chardet
import codecs

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)

if raw.startswith(codecs.BOM_UTF8):
    encoding = 'utf-8-sig'
else:
    result = chardet.detect(raw)
    encoding = result['encoding']

infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

网友

2楼 · 编辑于 2024-05-16 19:01:13

我根据丘伊的答案设计了一个漂亮的基于BOM的探测器。在通常的用例中，数据可以是已知的本地编码，也可以是带有BOM的Unicode（这是文本编辑器通常生成的）。更重要的是，与chardet不同，它不做任何随机猜测，因此它提供可预测的结果：

def detect_by_bom(path,default):
    with open(path, 'rb') as f:
        raw = f.read(4)    #will read less if the file is smaller
    for enc,boms in \
            ('utf-8-sig',(codecs.BOM_UTF8,)),\
            ('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
            ('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
        if any(raw.startswith(bom) for bom in boms): return enc
    return default

网友

3楼 · 编辑于 2024-05-16 19:01:13

没有理由检查BOM是否存在，utf-8-sig为您管理它，如果BOM不存在，则其行为与utf-8完全相同：

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

在上面的例子中，您可以看到utf-8-sig正确地解码给定的字符串，而不考虑BOM的存在。如果您认为您正在读取的文件中存在BOM字符的可能性很小，那么只需使用utf-8-sig，而不用担心

相关问题更多 >

编程相关推荐

热门问题

热门文章