Python 处理文件时的 Unicode 解码错误

1 投票
1 回答
540 浏览
提问于 2025-04-18 04:15

我在解码的时候遇到了一些麻烦。我在其他帖子里看到过如何处理简单字符串的方法,比如用 u'string'.encode。但我找不到适合文件的解码方法。

如果有人能帮忙,我会非常感激!

这是我的代码。

text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0)  # rewind
file.write(text.encode('utf-8'))

这是完整的代码,希望能有所帮助。

#!/usr/bin/env python
# coding: utf-8

"""
 Script to helps on translate some code's methods from
 portuguese to english.
"""

from multiprocessing import Pool
from mock import MagicMock
from goslate import Goslate
import fnmatch
import logging
import os
import re
import urllib2

_MAX_PEERS = 1
try:
    os.remove('traducoes.log')
except OSError:
    pass
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.FileHandler('traducoes.log')
logger.addHandler(handler)


def fileWalker(ext, dirname, names):
    """
    Find the files with the correct extension
    """
    pat = "*" + ext[0]
    for f in names:
        if fnmatch.fnmatch(f, pat):
            ext[1].append(os.path.join(dirname, f))


def encontre_text(file):
    """
    find on the string the works wich have '_' on it
    """
    text = file.read().decode('utf-8')
    return re.findall(r"\w+(?<=_)\w+", text)
    #return re.findall(r"\"\w+\"", text)


def traduza_palavra(txt):
    """
        Translate the word/phrase to english
    """
    try:
        # try connect with google
        response = urllib2.urlopen('http://google.com', timeout=2)
        pass
    except urllib2.URLError as err:
        print "No network connection "
        exit(-1)
    if txt[0] != '_':
        txt = txt.replace('_', ' ')
    txt = txt.replace('media'.decode('utf-8'), 'média'.decode('utf-8'))
    gs = Goslate()
    #txt = gs.translate(txt, 'en', gs.detect(txt))
    txt = gs.translate(txt, 'en', 'pt-br')  # garantindo idioma tupiniquim
    txt = txt.replace(' en ', ' br ')
    return txt.replace(' ', '_')  # .lower()


def subistitua(file, txt, novo_txt):
    """
    should rewrite the file with the new text in the future
    """
    text = file.read()
    text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
    file.seek(0)  # rewind
    file.write(text.encode('utf-8'))


def magica(File):
    """
    Thread Pool. Every single thread should play around here with
    one element from list os files
    """
    global _DONE
    if _MAX_PEERS == 1:  # inviavel em multithread
        logger.info('\n---- File %s' % File)
    with open(File, "r+") as file:
        list_txt = encontre_text(file)
        for txt in list_txt:
            novo_txt = traduza_palavra(txt)
            if txt != novo_txt:
                logger.info('%s -> %s [%s]' % (txt, novo_txt, File))
            subistitua(file, txt, novo_txt)
        file.close()
    print File.ljust(70) + '[OK]'.rjust(5)

if __name__ == '__main__':
    try:
        response = urllib2.urlopen('http://www.google.com.br', timeout=1)
    except urllib2.URLError as err:
        print "No network connection "
        exit(-1)
    root = './app'
    ex = ".py"
    files = []
    os.path.walk(root, fileWalker, [ex, files])

    print '%d files found to be translated' % len(files)
    try:
        if _MAX_PEERS > 1:
            _pool = Pool(processes=_MAX_PEERS)
            result = _pool.map_async(magica, files)
            result.wait()
        else:
            result = MagicMock()
            result.successful.return_value = False
            for f in files:
                pass
                magica(f)
            result.successful.return_value = True
    except AssertionError, e:
        print e
    else:
        pass
    finally:
        if result.successful():
            print 'Translated all files'
        else:
            print 'Some files were not translated'

谢谢大家的帮助!

1 个回答

1

在Python 2中,从文件读取数据时,得到的是普通的(字节)字符串对象,而不是Unicode对象。你不需要对这些字符串调用.encode(),因为这样做只会先自动转换成Unicode,这可能会出错。

一个简单的规则是:使用Unicode三明治。每当你读取数据时,那个时候就要把它解码成Unicode。在你的代码中始终使用Unicode值。每当你写入数据时,那个时候再进行编码。你可以使用io.open()来打开文件,这样它会自动帮你处理编码和解码。

这也意味着你可以在任何地方使用Unicode字面量,比如在正则表达式和字符串字面量中。所以可以使用:

def encontre_text(file):
    text = file.read()  # assume `io.open()` was used
    return re.findall(ur"\w+(?<=_)\w+", text)  # use a unicode pattern

以及

def subistitua(file, txt, novo_txt):
    text = file.read()  # assume `io.open()` was used
    text = text.replace(txt, novo_txt)
    file.seek(0)  # rewind
    file.write(text)

因为程序中的所有字符串值已经是unicode,而且

txt = txt.replace(u'media', u'média')

中的u'..' Unicode字符串字面量不需要再解码了。

撰写回答