当我加载数据集时，为什么会出现Unicode解码错误的提示消息？

1条回答

网友

1楼 · 发布于 2024-04-17 20:30:19

首先，你需要知道什么是字符编码。不是UTF-8。你知道吗

有很多不同的字符编码，有时Excel会将编码改为“iso-8859-1”或“cp1252”，这太疯狂了。你知道吗

以下是每个IT人员都必须知道的重要信息：The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

要解决您的问题，至少有三种选择：

1）尝试一些可能性（拉丁语1、cp1252等）：

df= pd.read_csv('file.csv',encoding ='latin1')

2）在读取之前用UTF-8编码（或其他原始编码）保存文件。可能Windows会在您打开它（Excel）并更新某些行后更改编码。你知道吗

3）解决这个问题的一种方法是尝试测试一系列不同的字符编码，看看它们是否有效。不过，更好的方法是使用chardet模块尝试并自动猜测正确的编码是什么。这不是100%保证是正确的，但它通常比仅仅猜测更快：

import chardet

# look at the first ten thousand bytes to guess the character encoding
with open('file.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.99, 'language': ''}

# read in the file with the encoding detected by chardet
df = pd.read_csv('file.csv', encoding='Windows-1252')

相关问题更多 >

编程相关推荐

热门问题

热门文章

当我加载数据集时，为什么会出现Unicode解码错误的提示消息？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >