MySQL查询前解码ISO88591并编码为UTF8

2024-04-25 22:52:05 发布

您现在位置:Python中文网/ 问答频道 /正文

如果我做得对的话,我有点卡住了。在

我有一个文件是ISO-8859-1(相当确定)。我的MySQL数据库采用utf-8编码。这就是为什么我要先将文件转换成UTF-8编码字符,然后才能将其作为查询发送。例如,首先我重写的每一行文件.txt放入文件_新建.txt使用。在

line = line.decode('ISO-8859-1').encode('utf-8')

然后我保存它。接下来,我创建一个MySQL连接并使用以下查询创建一个游标,以便所有数据都以utf-8的形式接收。在

^{pr2}$

之后,我重新打开文件_新建.txt并将每一行输入MySQL。这是用MySQL utf-8编码获取表的正确方法吗?或者我错过了什么重要的部分?在

现在来接收这些数据。我也使用'SET NAMES "utf8""。但是当我将header内容类型设置为

header("Content-Type: text/html; charset=utf-8");

另一方面,当我

header("Content-Type: text/html; charset=ISO-8859-1");

它工作得很好,但是来自数据库的其他utf-8编码数据正在被打乱。所以我猜的是文件.txt仍未编码为utf-8。有人能解释为什么吗?在

PS:在我读everyline之前,我替换一个字符并保存文件.txt文件.txt.tmp。然后我读这个文件得到文件_新建.txt。我不知道它是否会对原始文件编码造成任何问题。在

f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
    f2.write(line.replace('\"', '\''))
f1.close()
f2.close()

在下面的例子中,我用utf-8编码了波斯语数据,这是正确的,但是其他非英语文本出现在“问号”中。这正是我的问题。在

示例:已删除。在


Tags: 文件数据txt数据库编码linemysqliso
3条回答

好吧,伙计们,所以我的编码是正确的。文件正在按需要将编码转换为utf-8。所有的问题都是对的。原来另一个阿拉伯语数据集是在ISO-8859-1中。因此,其中只有一个在工作。不管我做了什么。在

Hexeditors帮了大忙。但最后我只是用sublime文本重新检查我的编码数据是否是utf-8。原来python脚本和sublime编辑器也是这样做的。所以密码没问题。:)

欢迎来到unicode和windows的奇妙世界。我发现这个站点对于理解我的字符串http://www.i18nqa.com/debug/utf8-debug.html出了什么问题非常有帮助。您需要的另一件事是像HxD这样的十六进制编辑器。有很多地方会出问题。例如,如果您在文本编辑器中查看文件,它可能会试图有所帮助,并且会静默地更改您的编码。在

从您的原始数据开始,在HxD中查看它并查看编码是什么。在Hxd中查看您的结果,并查看您期望的更改是否正在进行。重复流程中的步骤。在

没有完整的代码和示例数据,很难说问题出在哪里。我猜你在二进制文件中用单引号替换双引号是罪魁祸首。在

同时查看The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

试试这个:

line = line.decode('ISO-8859-1').encode('utf-8-sig')

从文件中:

As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE.

Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

来源:https://docs.python.org/3.5/library/codecs.html

编辑:

样品: "Hello World".encode('utf-8')生成b'Hello World',而"Hello World".encode('utf-8-sig')生成{}突出显示文档:

On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.

编辑: 在此之前,我做了一个类似的函数,将文件转换为utf-8编码。下面是一个片段:

^{pr2}$

根据您的示例,请尝试以下操作:

convert_encoding('file.txt.tmp', 'file_new.txt')

相关问题 更多 >