将文件从GBK转换为UTF-8并在控制台中显示

0 投票

1 回答

4849 浏览

提问于 2025-04-17 22:55

我的系统是：python3.3 + win7。
文件 c:\\test_before 是用 gbk 编码的。你可以从这里下载并保存为 c:\\test_before 来测试。
http://pan.baidu.com/s/1i3DSuKd
当我设置 chcp 936 时，我可以看到每一行的输出。

cname="c:\\test_before"
dat=open(cname,"r")
for line in dat.readlines():
    print(line)

在这里输入图片描述

现在，我用 Python 把文件改成了 utf-8 编码。

cname="c:\\test_before"
dat=open(cname,"rb")
new=open("c:\\test_utf-8","wb")
for line in dat.readlines():
    line=line.decode("gbk").encode("utf-8")
    new.write(line)

new.close()

当我设置 chcp 65001 并运行它时

new=open("c:\\test_utf-8","r")
for line in new.readlines():
    print(line)

为什么我会得到错误的输出呢？
出现了 UnicodeDecodeError: 'gbk' 编码无法解码位置 370 的字节 0xa5：非法的多字节序列。

错误处理 unicode utf-8 编码转换文件编码控制台输出 gbk 多字节序列

1 个回答

很有可能，Python 并不会检测到通过 chcp 命令所做的临时代码页更改，因此在你调用 open 时，它可能不会使用正确的编码。你可以通过以下方法自己验证这一点：

>>> fd = open('/tmp/somefile.txt', 'r')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>

当然，在 Python 3 中你可以覆盖这个问题，你可以这样做：

>>> fd = open('/tmp/somefile.txt', 'r', encoding='UTF-8')
>>> fd
<_io.TextIOWrapper name='/tmp/somefile.txt' mode='r' encoding='UTF-8'>

让 encoding 参数更明确，可能是你想要的。

另外，你也可以在不使用二进制模式的情况下打开写入端（我看到你指定了 'wb'）。只需使用 'w'，并明确你的目标编码，如果你在转换编码的话。

>>> fd2 = open('/tmp/write.txt', 'w', encoding='UTF-8')
>>> fd2.write(u'abcd話')
5

不过，它会返回写入的字符数量。

为了完成你的转换，你可以确实做一些类似于：

cname = "c:\\test_before"
dat = open(cname, "r", encoding="gbk")
new = open("c:\\test_utf-8", "w", encoding="utf-8")
for line in dat.readlines():
    new.write(line)

new.close()

最后，你应该使用文件处理器的上下文管理器，这样可以保持一致性，并且在这种简单的使用情况下不需要手动关闭文件，你的代码看起来会像这样：

def gbk_to_utf8(source, target):
    with open(source, "r", encoding="gbk") as src: 
        with open(target, "w", encoding="utf-8") as dst: 
            for line in src.readlines():
                dst.write(line)

回答于 2025-04-17 由 Python大师

分享举报

将文件从GBK转换为UTF-8并在控制台中显示

1 个回答

撰写回答