使用简单的Python脚本读取字典词文件时出现UnicodeDecodeError

0 投票

2 回答

2023 浏览

提问于 2025-04-15 12:21

我好久没用Python了，这次想做个简单的文件扫描，但在用Python 3.0.1运行下面的脚本时遇到了麻烦。

with open("/usr/share/dict/words", 'r') as f:
   for line in f:
       pass

结果我收到了这个异常：

Traceback (most recent call last):
  File "/home/matt/install/test.py", line 2, in <module>
    for line in f:
  File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
    line = self.readline()
  File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
    while self._read_chunk():
  File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
    output = self.decoder.decode(input, final=final)
  File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data

出问题的那一行是“Argentinian”，看起来并没有什么特别的地方。

更新：我在open()调用中加了这个，

encoding="iso-8559-1"

问题就解决了。

异常处理 unicode 文件读取编码错误脚本调试字典文件

2 个回答

你能检查一下它是否是有效的UTF-8编码吗？一种方法可以参考这个StackOverflow的问题：

iconv -f UTF-8 /usr/share/dict/words -o /dev/null

其实还有其他方法可以做到这一点。

回答于 2025-04-15 由 Python大师

分享举报

你是怎么从“位置 1689-1692”判断出文件中出错的那一行的？这些数字其实是它在解码时的偏移量。你需要先搞清楚是哪个数据块出问题了——你是怎么做到的呢？

在交互式提示符下试试这个：

buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting

回答于 2025-04-15 由 Python大师

分享举报

使用简单的Python脚本读取字典词文件时出现UnicodeDecodeError

2 个回答

撰写回答