Python:如何将Windows 1251转换为Unicode?
我想用Python把文件内容从Windows-1251(西里尔字母)转换成Unicode格式。我找到了一个函数,但它没有用。
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
我该怎么做呢?
谢谢你
3 个回答
0
这只是我的猜测,因为你没有说明“无法工作”具体指什么。
如果文件生成得没问题,但里面出现了一些乱码,可能是你用来查看这个文件的应用程序不认识它是UTF-8格式的。你需要在文件的开头加一个BOM(字节顺序标记)——就是3个字节 0xEF,0xBB,0xBF
(不需要编码)。
0
如果你使用codecs
模块来打开文件,当你从文件中读取内容时,它会自动把内容转换成Unicode格式。例如:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
这只有在你用Python处理文件数据时才有意义。如果你想在文件系统中把一个文件从一种编码转换成另一种(这正是你发的脚本想做的),你就得指定一个具体的编码,因为你不能直接把文件写成“Unicode”格式。
24
import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
这就是你想要做的事情吗?