MySQL查询前解码ISO88591并编码为UTF8

3条回答

网友

1楼 · 编辑于 2024-04-25 22:52:05

好吧，伙计们，所以我的编码是正确的。文件正在按需要将编码转换为utf-8。所有的问题都是对的。原来另一个阿拉伯语数据集是在ISO-8859-1中。因此，其中只有一个在工作。不管我做了什么。在

Hexeditors帮了大忙。但最后我只是用sublime文本重新检查我的编码数据是否是utf-8。原来python脚本和sublime编辑器也是这样做的。所以密码没问题。：）

网友

2楼 · 编辑于 2024-04-25 22:52:05

欢迎来到unicode和windows的奇妙世界。我发现这个站点对于理解我的字符串http://www.i18nqa.com/debug/utf8-debug.html出了什么问题非常有帮助。您需要的另一件事是像HxD这样的十六进制编辑器。有很多地方会出问题。例如，如果您在文本编辑器中查看文件，它可能会试图有所帮助，并且会静默地更改您的编码。在

从您的原始数据开始，在HxD中查看它并查看编码是什么。在Hxd中查看您的结果，并查看您期望的更改是否正在进行。重复流程中的步骤。在

没有完整的代码和示例数据，很难说问题出在哪里。我猜你在二进制文件中用单引号替换双引号是罪魁祸首。在

同时查看The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

网友

3楼 · 编辑于 2024-04-25 22:52:05

试试这个：

line = line.decode('ISO-8859-1').encode('utf-8-sig')

从文件中：

As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

来源：https://docs.python.org/3.5/library/codecs.html

编辑：

样品： "Hello World".encode('utf-8')生成b'Hello World'，而"Hello World".encode('utf-8-sig')生成{}突出显示文档：

On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.

编辑： 在此之前，我做了一个类似的函数，将文件转换为utf-8编码。下面是一个片段：

^{pr2}$

根据您的示例，请尝试以下操作：

convert_encoding('file.txt.tmp', 'file_new.txt')

相关问题更多 >

编程相关推荐

热门问题

热门文章