SQLite、Python、Unicode 和非 UTF 数据

Question

我开始尝试用Python在sqlite中存储字符串，结果收到了这样的提示：

sqlite3.ProgrammingError: 你不能使用8位字节字符串，除非你使用一个可以解释8位字节字符串的text_factory（比如text_factory = str）。强烈建议你直接将应用程序切换到Unicode字符串。

好的，我切换到了Unicode字符串。然后我又收到了这样的提示：

sqlite3.OperationalError: 无法解码UTF-8列'tag_artist'中的文本'Sigur Rós'

当我尝试从数据库中获取数据时出现了这个问题。经过更多的研究，我开始用utf8编码，但'Sigur Rós'变成了'Sigur RÃ³s'

注意：我的控制台设置为显示'latin_1'，正如@John Machin指出的那样。

这是怎么回事？在阅读了这个，描述了我所遇到的完全相同的情况后，似乎建议是忽略其他建议，最终还是使用8位字节字符串。

在开始这个过程之前，我对unicode和utf了解不多。在过去的几个小时里，我学到了很多，但我仍然不知道有没有办法正确地将'ó'从latin-1转换为utf-8而不出错。如果没有，为什么sqlite会“强烈推荐”我将应用程序切换到unicode字符串呢？

我打算更新这个问题，提供一个总结和一些我在过去24小时里学到的示例代码，以便和我有相同处境的人能有一个更简单的指南。如果我发布的信息有误或误导，请告诉我，我会更新，或者你们这些资深的可以来更新。

答案总结

让我先说明一下我的理解目标。处理各种编码的目标，如果你想在它们之间转换，就是要了解你的源编码是什么，然后使用该源编码转换为unicode，再转换为你想要的编码。Unicode是一个基础，而编码是该基础的子集的映射。utf_8可以容纳unicode中的每个字符，但因为它们的位置不同，比如说latin_1，所以用utf_8编码的字符串发送到latin_1控制台时，显示的效果可能和你预期的不一样。在Python中，从unicode转换到其他编码的过程看起来像这样：

str.decode('source_encoding').encode('desired_encoding')

或者如果字符串已经是unicode格式

str.encode('desired_encoding')

对于sqlite，我其实不想再编码一次，我想解码并保持在unicode格式。这里有四件事你可能需要了解，当你尝试在Python中处理unicode和编码时。

你想处理的字符串的编码，以及你想转换到的编码。
系统编码。
控制台编码。
源文件的编码。

详细说明：

(1) 当你从某个来源读取字符串时，它必须有某种编码，比如latin_1或utf_8。在我的情况下，我是从文件名中获取字符串，所以不幸的是，我可能会得到任何类型的编码。Windows XP使用UCS-2（一个Unicode系统）作为其本地字符串类型，这对我来说似乎有点不公平。幸运的是，大多数文件名中的字符不会由多种源编码类型组成，我认为我的文件名要么完全是latin_1，要么完全是utf_8，或者只是普通的ascii（这是这两者的子集）。所以我只是读取它们，并假设它们仍然是latin_1或utf_8编码。不过，有可能在Windows的文件名中混合了latin_1和utf_8以及其他字符。有时这些字符会显示为方框，有时看起来很混乱，有时则显示正确（带重音的字符等）。继续。

(2) Python有一个默认的系统编码，在Python启动时设置，并且在运行时无法更改。有关详细信息，请查看这里。简单总结一下...这是我添加的文件：

\# sitecustomize.py  
\# this file can be anywhere in your Python path,  
\# but it usually goes in ${pythondir}/lib/site-packages/  
import sys  
sys.setdefaultencoding('utf_8')

这个系统编码是在你使用unicode("str")函数而没有其他编码参数时使用的。换句话说，Python尝试根据默认系统编码将“str”解码为unicode。

(3) 如果你使用的是IDLE或命令行Python，我认为你的控制台将根据默认系统编码显示。我出于某种原因使用pydev和eclipse，所以我不得不进入我的项目设置，编辑我的测试脚本的启动配置属性，转到Common标签，并将控制台从latin-1更改为utf-8，以便我可以直观地确认我所做的工作。

(4) 如果你想在源代码中有一些测试字符串，例如

test_str = "ó"

，那么你需要告诉Python你在该文件中使用的编码。（顺便说一下：当我错误输入编码时，我不得不按ctrl-Z，因为我的文件变得不可读。）这可以通过在源代码文件的顶部添加一行来轻松完成：

# -*- coding: utf_8 -*-

如果你没有这些信息，Python会默认尝试将你的代码解析为ascii，因此：

SyntaxError: 文件redacted的第81行中有非ASCII字符'\xf3'，但没有声明编码；请参见http://www.python.org/peps/pep-0263.html以获取详细信息

一旦你的程序正常工作，或者如果你不使用Python的控制台或其他控制台查看输出，那么你可能只关心列表中的第1项。系统默认和控制台编码并不那么重要，除非你需要查看输出，或者你使用内置的unicode()函数（没有任何编码参数）而不是string.decode()函数。我写了一个演示函数，我会把它粘贴到这个庞大内容的底部，希望它能正确演示我列表中的项目。当我通过演示函数运行字符'ó'时，以下是一些输出，显示了各种方法对该字符作为输入的反应。我的系统编码和控制台输出都设置为utf_8：

'�' = original char <type 'str'> repr(char)='\xf3'
'?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

现在我将系统和控制台编码更改为latin_1，对于相同的输入，我得到了这样的输出：

'ó' = original char <type 'str'> repr(char)='\xf3'
'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

注意到'原始'字符显示正确，内置的unicode()函数现在也能正常工作。

现在我将控制台输出改回utf_8。

'�' = original char <type 'str'> repr(char)='\xf3'
'�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3'
'�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3'
'?' = char.decode('utf_8')  ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

在这里，一切仍然和上次一样工作，但控制台无法正确显示输出。等等。下面的函数还显示了更多信息，希望能帮助某人找出他们理解中的差距。我知道所有这些信息在其他地方都有，而且处理得更透彻，但我希望这能为想要用Python和/或sqlite编程的人提供一个良好的起点。想法很好，但有时源代码可以节省你一两天的时间来弄清楚函数的作用。

免责声明：我不是编码专家，我把这些整理在一起是为了帮助我自己的理解。我在构建时应该开始将函数作为参数传递，以避免这么多重复的代码，所以如果可以的话，我会让它更简洁。此外，utf_8和latin_1并不是唯一的编码方案，它们只是我正在尝试的两个，因为我认为它们能处理我需要的所有内容。你可以将自己的编码方案添加到演示函数中，测试自己的输入。

还有一件事：有一些疯狂的应用开发者在Windows中让生活变得困难。

#!/usr/bin/env python
# -*- coding: utf_8 -*-

import os
import sys

def encodingDemo(str):
    validStrings = ()
    try:        
        print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str))
        validStrings += ((str,""),)
    except UnicodeEncodeError as ude:
        print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print ude
    try:
        x = unicode(str)
        print "unicode(str) = ",x
        validStrings+= ((x, " decoded into unicode by the default system encoding"),)
    except UnicodeDecodeError as ude:
        print "ERROR.  unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string."
        print "\tThe system encoding is set to {0}.  See error:\n\t".format(sys.getdefaultencoding()),  
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print uee
    try:
        x = str.decode('latin_1')
        print "str.decode('latin_1') =",x
        validStrings+= ((x, " decoded with latin_1 into unicode"),)
        try:        
            print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8')
            validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),)
        except UnicodeDecodeError as ude:
            print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8.  See error:\n\t",
            print ude
    except UnicodeDecodeError as ude:
        print "Something didn't work, probably because the string wasn't latin_1 encoded.  See error:\n\t",
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",
        print uee
    try:
        x = str.decode('utf_8')
        print "str.decode('utf_8') =",x
        validStrings+= ((x, " decoded with utf_8 into unicode"),)
        try:        
            print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1')
        except UnicodeDecodeError as ude:
            print "str.decode('utf_8').encode('latin_1') didn't work.  The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1.  See error:\n\t",
            validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),)
            print ude
    except UnicodeDecodeError as ude:
        print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded.  See error:\n\t",
        print ude
    except UnicodeEncodeError as uee:
        print "ERROR.  Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string.  See error:\n\t",uee

    print
    print "Printing information about each character in the original string."
    for char in str:
        try:
            print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char))
        except UnicodeDecodeError as ude:
            print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = original char  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee)
            print uee    
            
        try:
            x = unicode(char)        
            print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = unicode(char) ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = unicode(char)  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
               
        try:
            x = char.decode('latin_1')
            print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = char.decode('latin_1')  ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = char.decode('latin_1')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
            
        try:
            x = char.decode('utf_8')
            print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x))
        except UnicodeDecodeError as ude:
            print "\t'?' = char.decode('utf_8')  ERROR: {0}".format(ude)
        except UnicodeEncodeError as uee:
            print "\t'?' = char.decode('utf_8')  {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee)
            
        print

x = 'ó'
encodingDemo(x)

非常感谢下面的回答，特别是@John Machin的详细解答。

字符串处理 unicode utf-8 编码转换 sqlite 文本编码数据库错误 latin-1

SQLite、Python、Unicode 和非 UTF 数据

5 个回答

撰写回答