Python 编码 UTF-8

52 投票

2 回答

433887 浏览

数据工程师

提问于 2025-04-17 17:11

我正在用Python写一些脚本。我创建了一个字符串，并把它保存到一个文件里。这个字符串包含了很多数据，来自一个目录的树状结构和文件名。根据convmv的说法，我的整个树状结构都是用UTF-8编码的。

我想保持所有内容都用UTF-8编码，因为我之后会把它保存到MySQL数据库里。目前在MySQL中（它也是用UTF-8编码），我遇到了一些字符的问题，比如法语中的é或è。

我希望Python始终使用UTF-8编码来处理字符串。我在网上查了一些资料，然后这样做了。

我的脚本是这样开始的：

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 def createIndex():
     import codecs
     toUtf8=codecs.getencoder('UTF8')
     #lot of operations & building indexSTR the string who matter
     findex=open('config/index/music_vibration_'+date+'.index','a')
     findex.write(codecs.BOM_UTF8)
     findex.write(toUtf8(indexSTR)) #this bugs!

但是当我执行时，得到了这个错误信息：UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

编辑：我发现，在我的文件中，带重音的字母写得很好。在创建这个文件后，我读取它并把它写入MySQL。但我不明白为什么会有编码的问题。我的MySQL数据库是utf8，或者说执行SQL查询SHOW variables LIKE 'char%'只返回utf8或binary。

我的函数是这样的：

#!/usr/bin/python
# -*- coding: utf-8 -*-

def saveIndex(index,date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost','admin','*******','music_vibration')
    sql.charset="utf8"
    findex=open('config/index/'+index,'r')
    lines=findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste=line.split('[:::]')
            artiste=artiste[1].replace('\n','')

            c=sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
            nbr=c.fetchone()
            if nbr[0]==0:
                c=sql.cursor()
                iArt+=1
                c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

而在文件中显示正常的艺术家名字在数据库中却写得不对。问题出在哪里呢？

数据库 mysql 字符串处理字符编码 utf-8 编码问题文件读写法语字符

2 个回答

很遗憾，string.encode()这个方法并不是总是可靠的。如果你想了解更多信息，可以看看这个讨论：在Python中，如何将某个字符串（无论是UTF-8还是其他）转换为简单的ASCII字符串的可靠方法是什么

回答于 2025-04-17 由 Python大师

分享举报

你不需要对已经编码过的数据再进行编码。当你尝试这样做时，Python 会先把它解码成 unicode，然后再编码回 UTF-8。这就是这里出错的原因：

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

直接把你的数据写入文件就可以了，不需要对已经编码的数据再进行编码。

如果你是先构建 unicode 值，那么在写入文件之前确实需要对它们进行编码。你应该使用 codecs.open()，这样可以返回一个文件对象，它会帮你把 unicode 值编码成 UTF-8。

你也真的不想写出 UTF-8 BOM，除非你必须支持一些无法读取 UTF-8 的微软工具（比如 MS Notepad）。

对于你的 MySQL 插入问题，你需要做两件事：

在你的 MySQLdb.connect() 调用中添加 charset='utf8'。

在查询或插入时使用 unicode 对象，而不是 str 对象，但 要使用 SQL 参数，这样 MySQL 连接器才能为你处理好这些事情：

artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode

c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))

# ...

c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))

如果你使用 codecs.open() 来自动解码内容，效果可能会更好：

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

你可能想要了解一下 Unicode 和 UTF-8 以及编码的相关知识。我可以推荐以下几篇文章：

回答于 2025-04-17 由 Python大师

分享举报

Python 编码 UTF-8

2 个回答

撰写回答