Python - 查找unicode/ascii问题

2 投票
4 回答
8179 浏览
提问于 2025-04-15 22:15

我正在使用csv.reader从一个非常长的表格中提取信息。我在这个数据集上进行处理,然后使用xlwt这个工具来生成一个可以操作的Excel文件。

但是,我遇到了这个错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)

我想问大家,如何才能找到数据集中具体出错的位置呢?另外,有没有什么代码可以帮我检查数据集,找出问题所在(因为有些数据集可以正常运行,而有些则会出错)?

4 个回答

0

你可以参考下面的问题中的代码片段,来获取一个支持unicode编码的csv读取器:

1

csv模块不支持unicode字符和空字符。不过,你可以尝试用下面的方法来替换它们(把'utf-8'换成你的CSV数据所用的编码):

import codecs
import csv

class AsciiFile:
    def __init__(self, path):
        self.f = codecs.open(path, 'rb', 'utf-8')

    def close(self):
        self.f.close()

    def __iter__(self):
        for line in self.f:
            # 'replace' for unicode characters -> ?, 'ignore' to ignore them
            y = line.encode('ascii', 'replace')
            y = y.replace('\0', '?') # Can't handle null characters!
            yield y

f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()

如果你想找出那些CSV模块无法处理的字符的位置,可以这样做:

import codecs

lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
    for x, c in enumerate(line):
        if not c.encode('ascii', 'ignore') or c == '\0':
            print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
    lineno += 1
f.close()

另外,你也可以使用我写的这个CSV打开器,它可以处理Unicode字符:

import codecs

def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
    infile = codecs.open(Path, "rb", Encoding, errors=Errors)
    for Line in infile:
        Line = Line.strip('\r\n')
        if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
        elif Qualifier != '(None)':
            # Take a note of the chars 'before' just 
            # in case of excel-style """ quoting.
            cB41 = ''; cB42 = ''
            L = ['']
            qMode = False
            for c in Line: 
                if c==Qualifier and c==cB41==cB42 and qMode:
                    # Triple qualifiers, so allow it with one
                    L[-1] = L[-1][:-2]
                    L[-1] += c
                elif c==Qualifier: 
                    # A qualifier, so reverse qual mode
                    qMode = not qMode
                elif c in Delims and not qMode: 
                    # Not in qual mode and delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
                cB42 = cB41
                cB41 = c
            yield L
        else:
            # There aren't any qualifiers.
            cB41 = ''; cB42 = ''
            L = ['']
            for c in Line: 
                cB42 = cB41; cB41 = c
                if c in Delims: 
                    # Delim
                    L.append('')
                else: 
                    # Nothing to see here, move along
                    L[-1] += c
            yield L

for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
    ...
3

其实答案很简单:当你从文件中读取数据时,记得用文件的编码格式把它转换成unicode,并处理可能出现的UnicodeDecodeError错误。

try:
        # decode using utf-8 (use ascii if you want)
        unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
        print "The error is there !"

这样做可以避免很多麻烦;你就不用担心多字节字符的编码问题了,而且外部库(比如xlwt)在需要写入时也会自动处理好这些事情。

在Python 3.0中,指定字符串的编码格式是必须的,所以现在就养成这个习惯是个好主意。

撰写回答