import pandas
for file in ['workfile-utf-8.csv', 'workfile-cp1252.csv']:
for file_encoding in ['utf-8', 'cp1252']:
for pandas_encoding in [None, 'utf-8', 'cp1252']:
with open(file, 'r', encoding=file_encoding) as fp:
try:
print('***', file, fp, pandas_encoding)
df = pandas.read_csv(fp, encoding=pandas_encoding)
print(df)
except Exception as ex:
print(ex)
所提到的文件采用的编码方式反映在它们的名称中。在
输出应该是这样的(可能取决于您的环境)
(1) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='utf-8'> None
a b c
0 Hällo €uro Öl
(2) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='utf-8'> utf-8
a b c
0 Hällo €uro Öl
(3) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='utf-8'> cp1252
a b c
0 Hällo €uro Öl
(4) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='cp1252'> None
a b c
0 Hällo €uro Öl
(5) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='cp1252'> utf-8
a b c
0 Hällo €uro Öl
(6) workfile-utf-8.csv <_io.TextIOWrapper name='workfile-utf-8.csv' mode='r'
encoding='cp1252'> cp1252
a b c
0 Hällo €uro Öl
(7) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='utf-8'> None
Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
(8) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='utf-8'> utf-8
Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
(9) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='utf-8'> cp1252
Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
(10) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='cp1252'> None
a b c
0 Hällo €uro Öl
(11) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='cp1252'> utf-8
a b c
0 Hällo €uro Öl
(12) workfile-cp1252.csv <_io.TextIOWrapper name='workfile-cp1252.csv'
mode='r' encoding='cp1252'> cp1252
a b c
0 Hällo €uro Öl
这就是在你的程序中发生的事情
(1)要求文件使用编码“utf-8”解码字节。如果打印文件句柄f的表示形式,它将显示如下内容
^{pr2}$当您从这个包装中提取文本时,您将得到一个unicode字符串。在
(2)read_csv()被告知使用某种编码e。因此它将把unicode字符串转换成字节(执行隐式encode(),在我的系统上使用'utf-8',然后用解码e解码
这里有一个小的测试程序用于说明
所提到的文件采用的编码方式反映在它们的名称中。在
输出应该是这样的(可能取决于您的环境)
(1)文件为utf-8->;解码utf-8->;按原样使用->确定
(2)文件是utf-8->;decode utf-8->;(encode w.default)—>;(decode utf-8)—>;此处为“确定”,但在其他环境中则不是
(3)文件为utf-8->;解码utf-8->;(编码w.default)—>;(解码cp1252)—>;将Hällo转换为HÃllo等
。。。在
(7)文件为cp1252->;解码utf-8->;引发UnicodeDecodeError,并导致错误
。。。在
(11)文件是cp1252->;decode cp1252->;(encode w.default)—>;(decode utf-8)—>;此处为“确定”,但在其他环境中则不是
。。。在
有趣(有趣的是)在特定的情况下(6)把Hällo€uro,Ãl变成Hцllo,Ãèuro,Ãuro,Ã
它对应于一个序列:
相关问题 更多 >
编程相关推荐