在Python中从gzip文件中读取utf-8字符

33 投票

5 回答

38606 浏览

数据工程师

提问于 2025-04-15 16:53

我正在尝试用Python读取一个解压缩的文件（.gz），但是遇到了一些问题。

我使用了gzip模块来读取这个文件，但这个文件是以utf-8格式编码的，所以最后它读取到一个无效字符就崩溃了。

有没有人知道怎么读取以utf-8编码的gzip文件？我知道有一个codecs模块可以帮助我，但我不太明白怎么用。

谢谢！

import string
import gzip
import codecs

f = gzip.open('file.gz','r')

engines = {}
line = f.readline()
while line:
    parsed = string.split(line, u'\u0001')

    #do some things...

    line = f.readline()
for en in engines:
  print(en)

utf-8 数据读取编码问题 gzip 文件解压 codecs模块

5 个回答

也许吧

import codecs
zf = gzip.open(fname, 'rb')
reader = codecs.getreader("utf-8")
contents = reader( zf )
for line in contents:
    pass

回答于 2025-04-15 由 Python大师

分享举报

从Python 3.3开始，这个功能是可以实现的：

import gzip
gzip.open('file.gz', 'rt', encoding='utf-8')

请注意，使用gzip.open()时，你需要明确指定文本模式（'t'）。

回答于 2025-04-15 由 Python大师

分享举报

我不明白为什么这会这么难。

你到底在做什么呢？请解释一下“最终它读取了一个无效字符”是什么意思。

这应该很简单：

import gzip
fp = gzip.open('foo.gz')
contents = fp.read() # contents now has the uncompressed bytes of foo.gz
fp.close()
u_str = contents.decode('utf-8') # u_str is now a unicode string

编辑

这个答案适用于 Python2 和 Python3，请查看 @SeppoEnarvi 的回答，链接在这里：https://stackoverflow.com/a/19794943/610569（它使用了 gzip.open 的 rt 模式）。

回答于 2025-04-15 由 Python大师

分享举报

在Python中从gzip文件中读取utf-8字符

5 个回答

编辑

撰写回答