SQLite、python、unicode和非utf d

'�' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

'�' = original char <type 'str'> repr(char)='\xf3' '�' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '�' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data

#!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x)

3条回答

网友

1楼 · 编辑于 2024-06-07 11:42:49

I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it

在调试此类问题时，repr（）和unicodedata.name（）是您的朋友：

>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>

如果将oacute_utf8发送到为latin1设置的终端，则会得到a-tilde后跟上标-3。

I switched to Unicode strings.

你在叫什么Unicode字符串？联合特遣部队-16？

What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.

我无法想象你怎么会这么想。所传达的故事是，用Python编写的unicode对象和数据库中的UTF-8编码是解决问题的方法。然而，马丁回答了最初的问题，给出了一个方法（“文本工厂”）使OP能够使用latin1——这并不构成一个建议！

更新针对评论中提出的进一步问题：

I didn't understand that the unicode characters still contained an implicit encoding. Am I saying that right?

不。编码是Unicode和其他东西之间的映射，反之亦然。Unicode字符没有编码（隐式或其他）。

It looks to me like unicode("\xF3") and "\xF3".decode('latin1') are the same when evaluated with repr().

说什么？在我看来不是这样的：

>>> unicode("\xF3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>

也许你的意思是：u'\xf3' == '\xF3'.decode('latin1')。。。这当然是真的。

同样地，unicode(str_object, encoding)也和str_object.decode(encoding)一样。。。包括在提供不适当的编码时爆炸。

Is that a happy circumstance

Unicode中的前256个字符是相同的，代码换代码，因为latin1中的256个字符是一个好主意。因为所有256个可能的latin1字符都映射到Unicode，这意味着任何8位字节、任何Python str对象都可以被解码为Unicode，而不会引发异常。这是应该的。

然而，有些人混淆了两个完全不同的概念：“我的脚本运行到完成，没有任何异常被提出”和“我的脚本是无错误的”。对他们来说，拉丁语是“陷阱和错觉”。

换言之，如果你有一个实际编码为cp1252或gbk或koi8-u的文件，并且你使用latin1对其进行解码，那么生成的Unicode将完全是垃圾，Python（或任何其他语言）不会标记错误——它无法知道你犯了愚蠢的错误。

or is unicode("str") going to always return the correct decoding?

就像那样，默认编码是ascii，如果文件实际上是用ascii编码的，它将返回正确的unicode。否则，它会爆炸。

类似地，如果您指定了正确的编码，或者指定了正确编码的超集，您将得到正确的结果。否则你会胡言乱语或者有个例外。

简而言之：答案是否定的

If not, when I receive a python str that has any possible character set in it, how do I know how to decode it?

如果str对象是一个有效的XML文档，它将被预先指定。默认值为UTF-8。如果它是一个构造正确的网页，那么应该在前面指定它（查找“charset”）。不幸的是，许多网页的作者都是彻头彻尾的（ISO-8859-1又名latin1，应该是Windows-1252又名cp1252；不要浪费资源试图解码gb2312，而是使用gbk）。你可以从网站的国籍/语言中获得线索。

UTF-8总是值得一试。如果数据是ascii，它就可以正常工作，因为ascii是utf8的一个子集。一个用非ascii字符编写并用非utf8编码的文本字符串，如果尝试将其解码为utf8，则几乎肯定会失败，但有一个例外。

以上所有的试探法和越来越多的统计信息都封装在chardet中，该模块用于猜测任意文件的编码。它通常工作得很好。然而，你不能让软件白痴证明。例如，如果将编写的一些数据文件与编码A连接起来，将一些数据文件与编码B连接起来，并将结果传送给chardet，则答案很可能是编码C，置信度降低，例如0.8。务必检查答案的置信度部分。

如果所有其他都失败了：

（1）试着问这里，在你的数据前面有一个小样本。。。print repr(your_data[:400])。。。以及你所掌握的关于其来源的任何附带信息。

（2）俄罗斯最近对techniques for recovering forgotten passwords的研究似乎非常适用于推断未知编码。

更新2顺便问一句，现在不是你提出另一个问题的时候了吗？-)

One more thing: there are apparently characters that Windows uses as Unicode for certain characters that aren't the correct Unicode for that character, so you may have to map those characters to the correct ones if you want to use them in other programs that are expecting those characters in the right spot.

不是Windows做的，而是一群疯狂的应用程序开发人员。可以理解的是，您可能没有重新解释，而是引用了您提到的effbot文章的开头部分：

Some applications add CP1252 (Windows, Western Europe) characters to documents marked up as ISO 8859-1 (Latin 1) or other encodings. These characters are not valid ISO-8859-1 characters, and may cause all sorts of problems in processing and display applications.

背景：

U+0000到U+001F（包括U+0000和U+001F）的范围在Unicode中被指定为“C0控制字符”。它们也存在于ASCII和latin1中，具有相同的含义。它们包括回车、换行、bell、backspace、tab和其他很少使用的家族事物。

U+0080到U+009F（包括U+0080和U+009F）的范围在Unicode中被指定为“C1控制字符”。这些字符也存在于latin1中，并且包含32个字符，unicode.org之外的任何人都无法想象这些字符的任何可能用途。

因此，如果对unicode或latin1数据运行字符频率计数，并且找到该范围内的任何字符，则数据已损坏。没有普遍的解决办法；这取决于它是如何被破坏的。字符可能在相同位置与cp1252字符具有相同的含义，因此efffot的解决方案将起作用。在我最近看到的另一个例子中，这些不可靠的字符似乎是由于将用UTF-8编码的文本文件和另一种需要根据文件所用（人类）语言中的字母频率推导出来的编码连接起来造成的。

网友
2楼 · 编辑于 2024-06-07 11:42:49

我通过设置：
conn.text_factory = lambda x: unicode(x, 'utf-8', 'ignore')
默认情况下，文本工厂设置为unicode（），它将使用当前的默认编码（在我的计算机上为ascii）

网友
3楼 · 编辑于 2024-06-07 11:42:49

UTF-8是SQLite数据库的默认编码。这在“SELECT CAST（x'52C3B373'作为文本）；”等情况下显示。但是，SQLite C库实际上并不检查插入数据库的字符串是否是有效的UTF-8。

如果插入Python unicode对象（或3.x中的str对象），Python sqlite3库将自动将其转换为UTF-8。但是如果你插入一个str对象，它会假设字符串是UTF-8，因为Python 2.x“str”不知道它的编码。这是偏爱Unicode字符串的一个原因。

但是，如果你的数据一开始就坏了，这对你没有帮助。

要修复数据，请执行

db.create_function('FIXENCODING', 1, lambda s: str(s).decode('latin-1'))
db.execute("UPDATE TheTable SET TextColumn=FIXENCODING(CAST(TextColumn AS BLOB))")

对于数据库中的每个文本列。

相关问题更多 >

编程相关推荐

热门问题

热门文章