Python中UnicodeDecodeError在读取文件时,如何忽略错误并跳到下一行?

2024-05-23 22:38:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我必须把一个文本文件读入Python。文件编码为:

file -bi test.csv 
text/plain; charset=us-ascii

这是第三方文件,我每天都会收到一个新的,所以我宁愿不更改它。文件包含非ascii字符,例如Ö。我需要使用python读取这些行,并且我可以忽略具有非ascii字符的行。

我的问题是,当我用Python读取文件时,当到达存在非ascii字符的行时,会得到unicodedecoderror,并且我无法读取文件的其余部分。

有办法避免这种情况吗。如果我试试这个:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

当到达错误时,for循环结束,我无法读取文件的其余部分。我想跳过引起错误的那一行,继续说下去。如果可能的话,我不想对输入文件做任何更改。

有办法吗? 非常感谢你。


Tags: 文件csvtexttest编码for错误line
1条回答
网友
1楼 · 发布于 2024-05-23 22:38:32

您的文件似乎没有使用UTF-8编码。打开文件时使用正确的编解码器很重要。

您可以用errors关键字告诉^{}如何处理解码错误:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

  • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

使用'strict''ignore''replace'等)以外的任何内容打开该文件,将允许您读取该文件,而不会引发异常。

请注意,解码是按缓冲数据块进行的,而不是按文本行进行的。如果必须逐行检测错误,请使用surrogateescape处理程序并测试代理项范围内读取的每一行代码点:

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

例如

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

考虑到并不是所有的解码错误都能正常恢复。虽然UTF-8被设计为在小错误面前具有健壮性,但其他多字节编码(如UTF-16和UTF-32)无法处理掉的或额外的字节,这将影响行分隔符的定位精度。然后,上述方法可以将文件的其余部分作为一个长行处理。如果文件足够大,那么如果“行”足够大,则反过来会导致MemoryError异常。

相关问题 更多 >