Python中UnicodeDecodeError在读取文件时，如何忽略错误并跳到下一行？

1条回答

网友

1楼 · 发布于 2024-05-23 22:38:32

您的文件似乎没有使用UTF-8编码。打开文件时使用正确的编解码器很重要。

您可以用errors关键字告诉^{}如何处理解码错误：

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

使用'strict'（'ignore'，'replace'等）以外的任何内容打开该文件，将允许您读取该文件，而不会引发异常。

请注意，解码是按缓冲数据块进行的，而不是按文本行进行的。如果必须逐行检测错误，请使用surrogateescape处理程序并测试代理项范围内读取的每一行代码点：

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

例如

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

考虑到并不是所有的解码错误都能正常恢复。虽然UTF-8被设计为在小错误面前具有健壮性，但其他多字节编码（如UTF-16和UTF-32）无法处理掉的或额外的字节，这将影响行分隔符的定位精度。然后，上述方法可以将文件的其余部分作为一个长行处理。如果文件足够大，那么如果“行”足够大，则反过来会导致MemoryError异常。

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python中UnicodeDecodeError在读取文件时，如何忽略错误并跳到下一行？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >