从二进制文件中读取UTF-8字符串

4 投票

2 回答

6840 浏览

提问于 2025-04-17 17:51

我有一些文件，里面包含了各种不同类型的二进制数据，我正在写一个模块来处理这些文件。

其中，它包含了以UTF-8编码的字符串，格式如下：前面有2个字节的大端格式的 stringLength（我用 struct.unpack() 来解析），然后就是字符串。因为是UTF-8编码，所以字符串的字节长度可能会大于 stringLength，如果直接用 read(stringLength) 来读取，就可能会少读，因为有些字符是多字节的（更别提会搞乱文件里的其他数据了）。

我该如何从文件中读取 n 个UTF-8字符（和 n 个字节是不同的），同时考虑到UTF-8的多字节特性呢？我在网上搜索了半个小时，找到的结果要么不相关，要么是一些我无法假设的内容。

字符串处理文件读取数据解析二进制文件 struct模块 utf-8编码多字节字符大端格式

2 个回答

在UTF-8编码中，一个字符可以占用1个字节、2个字节，甚至3个字节。

如果你需要逐字节读取文件，就得遵循UTF-8的编码规则。你可以查看这个链接了解更多信息：http://en.wikipedia.org/wiki/UTF-8

大多数情况下，你只需要把编码设置为utf-8，然后读取输入流就可以了。

你不需要担心自己读取了多少字节。

回答于 2025-04-17 由 Python大师

分享举报

给定一个文件对象和一些字符，你可以使用：

# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
    _lead_byte_to_count.append(
        1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)

def readUTF8(f, count):
    """Read `count` UTF-8 bytes from file `f`, return as unicode"""
    # Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
    res = []
    while count:
        count -= 1
        lead = f.read(1)
        res.append(lead)
        readcount = _lead_byte_to_count[ord(lead)]
        if readcount:
            res.append(f.read(readcount))
    return (''.join(res)).decode('utf8')

测试的结果：

>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'

在Python 3中，当然要简单得多，只需把文件对象放在一个 io.TextIOWrapper() 对象里，就可以把解码的工作交给Python自带的高效UTF-8实现。

回答于 2025-04-17 由 Python大师

分享举报

从二进制文件中读取UTF-8字符串

2 个回答

撰写回答