Python - 删除文件夹中所有文件的重音符号
我正在尝试从一个文件夹中的所有代码文件中去掉所有的重音符号。现在我已经成功列出了文件,但在使用unicodedata进行标准化时遇到了问题,报了一个错误:
** 错误追踪(Traceback): 文件 "/usr/lib/gedit-2/plugins/pythonconsole/console.py",第336行,在 __run exec command in self.namespace 文件 "",第2行,在 UnicodeDecodeError: 'utf8' 编解码器无法解码位置25的字节0xf3:无效的继续字节 **
if options.remove_nonascii:
nERROR = 0
print _("# Removing all acentuation from coding files in %s") % (options.folder)
exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
for dirpath, dirnames, filenames in os.walk(options.folder):
for filename in (f for f in filenames if f.endswith(exts)):
files.add(os.path.join(dirpath,filename))
for i in range(len(files)):
f = files.pop() ;
os.rename(f,f+'.BACK')
with open(f,'w') as File:
for line in open(f+'.BACK').readlines():
try:
newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore')
File.write(newLine)
except UnicodeDecodeError:
nERROR +=1
print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
newLine = line
File.write(newLine)
2 个回答
1
在使用unicode(line)的时候,你可能需要指定编码,比如可以写成unicode(line, 'utf-8')
如果你不太清楚该用什么编码,可以试试sys.getfilesystemencoding()这个方法,它可能会帮到你。
4
看起来这个文件可能是用cp1252编码的:
In [18]: print('\xf3'.decode('cp1252'))
ó
unicode(line)
出错了,因为unicode
试图用utf-8
编码来解码line
,所以出现了错误UnicodeDecodeError: 'utf8' codec can't decode...
。
你可以先尝试用cp1252来解码line
,如果不行,再试试utf-8:
if options.remove_nonascii:
nERROR = 0
print _("# Removing all acentuation from coding files in %s") % (options.folder)
exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
for dirpath, dirnames, filenames in os.walk(options.folder):
for filename in (f for f in filenames if f.endswith(exts)):
files.add(os.path.join(dirpath,filename))
for i,f in enumerate(files):
os.rename(f,f+'.BACK')
with open(f,'w') as fout:
with open(f+'.BACK','r') as fin:
for line fin:
try:
try:
line=line.decode('cp1252')
except UnicodeDecodeError:
line=line.decode('utf-8')
# If this still raises an UnicodeDecodeError, let the outer
# except block handle it
newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore')
fout.write(newLine)
except UnicodeDecodeError:
nERROR +=1
print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
newLine = line
fout.write(newLine)
顺便提一下,
unicodedata.normalize('NFKD',line).encode('ascii','ignore')
这样做有点危险。例如,它会完全去掉u'ß'和一些引号:
In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore')
Out[23]: ''
In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore')
Out[24]: ''
如果这造成了问题,可以使用unidecode模块:
In [25]: import unidecode
In [28]: print(unidecode.unidecode(u'‘’“”ß'))
''""ss