python通过在diff.functions中设置“encoding”参数来不同的编码结果

2024-04-26 18:12:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个功能

f = open('workfile', 'r', encoding='utf-8')
df = pandas.read_csv(...)

,打开一个csv文件。函数{/strong>设置函数{/strong>时,{/cd2>参数的设置{/strong>被函数


Tags: 文件csv函数功能pandasdfread参数
1条回答
网友
1楼 · 发布于 2024-04-26 18:12:37

这就是在你的程序中发生的事情

f = open('workfile', 'r', encoding='utf-8') # 1
df = pandas.read_csv(f, encoding=e) # 2

(1)要求文件使用编码“utf-8”解码字节。如果打印文件句柄f的表示形式,它将显示如下内容

^{pr2}$

当您从这个包装中提取文本时,您将得到一个unicode字符串。在

(2)read_csv()被告知使用某种编码e。因此它将把unicode字符串转换成字节(执行隐式encode(),在我的系统上使用'utf-8',然后用解码e解码

这里有一个小的测试程序用于说明

import pandas

for file in ['workfile-utf-8.csv', 'workfile-cp1252.csv']:
    for file_encoding in ['utf-8', 'cp1252']:
        for pandas_encoding in [None, 'utf-8', 'cp1252']:
            with open(file, 'r', encoding=file_encoding) as fp:
                try:
                    print('***', file, fp, pandas_encoding)
                    df = pandas.read_csv(fp, encoding=pandas_encoding)
                    print(df)
                except Exception as ex:
                    print(ex)

所提到的文件采用的编码方式反映在它们的名称中。在

输出应该是这样的(可能取决于您的环境)

(1) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='utf-8'> None
       a          b        c
0      Hällo       €uro       Öl
(2) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='utf-8'> utf-8
       a          b        c
0      Hällo       €uro       Öl
(3) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='utf-8'> cp1252
        a            b         c
0      Hällo       €uro       Öl
(4) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='cp1252'> None
        a            b         c
0      Hällo       €uro       Öl
(5) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='cp1252'> utf-8
        a            b         c
0      Hällo       €uro       Öl
(6) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='cp1252'> cp1252
          a                b            c
0      Hällo       €uro       Öl
(7) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='utf-8'> None
Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
(8) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='utf-8'> utf-8
Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
(9) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='utf-8'> cp1252
Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
(10) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='cp1252'> None
       a          b        c
0      Hällo       €uro       Öl
(11) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='cp1252'> utf-8
       a          b        c
0      Hällo       €uro       Öl
(12) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='cp1252'> cp1252
        a            b         c
0      Hällo       €uro       Öl

(1)文件为utf-8->;解码utf-8->;按原样使用->确定

(2)文件是utf-8->;decode utf-8->;(encode w.default)—>;(decode utf-8)—>;此处为“确定”,但在其他环境中则不是

(3)文件为utf-8->;解码utf-8->;(编码w.default)—>;(解码cp1252)—>;将Hällo转换为HÃllo等

。。。在

(7)文件为cp1252->;解码utf-8->;引发UnicodeDecodeError,并导致错误

。。。在

(11)文件是cp1252->;decode cp1252->;(encode w.default)—>;(decode utf-8)—>;此处为“确定”,但在其他环境中则不是

。。。在

有趣(有趣的是)在特定的情况下(6)把Hällo€uro,Ãl变成Hцllo,Ãèuro,Ãuro,Ã

它对应于一个序列:

>>> x1 = 'Hällo,€uro, Öl'
>>> x1
'Hällo,€uro, Öl'
>>> x2 = x1.encode()
>>> x2
b'H\xc3\xa4llo,\xe2\x82\xacuro, \xc3\x96l'
>>> x3 = x2.decode('cp1252')
>>> x3
'Hällo,€uro, Öl'
>>> x4 = x3.encode()
>>> x4
b'H\xc3\x83\xc2\xa4llo,\xc3\xa2\xe2\x80\x9a\xc2\xacuro, \xc3\x83\xe2\x80\x93l'
>>> x5 = x4.decode('cp1252')
>>> x5
'Hällo,€uro, Öl'

相关问题 更多 >