Python 字符串相等测试结果不一致
下面这个函数是在一个脚本中用来创建一个Django网站的静态版本:
def write_file(filename, content):
filename = '{0}{1}.html'.format(BASEDIR, filename)
if os.path.exists(filename):
existing_file = io.open(filename, encoding='utf-8')
existing_content = existing_file.read()
existing_file.close()
if existing_content != content:
print "Content is not equal, writing file to {0}".format(filename)
encoded_content = content.encode('utf-8')
html_file = open(filename, 'w')
html_file.write(encoded_content)
html_file.close()
else:
print "Content is equal, nothing is written to {0}".format(filename)
当我运行这个脚本两次(数据库没有任何变化),正常情况下应该不会有任何写入操作。但是奇怪的是,有超过一半的文件被反复写入。
2 个回答
0
你描述的情况很可能是数据在某个环节被编码了两次,或者是文本在和unicode进行比较。在Python 2.x中,abc` == u`abc
,所以一些只包含ASCII字符的文件在比较时会通过测试,而另一部分包含非ASCII字符的文件在经过UTF-8编码前后就不会相同了。
要弄清楚发生了什么,最简单的方法就是改善你代码中的错误报告:在else语句之后,添加:
print repr(existing_content), repr(content)
0
我建议使用 codecs
这个模块;可以这样做:
import codecs
def write_file(filename, content):
filename = "{0}{1}.html".format(BASEDIR, filename)
if os.path.exists(filename):
# open file and read into a utf8 string.
# Calling open(), read(), then close() can all be made into 1 call.
# python will handle the closing and gc for you
existing_content = codecs.open(filename, "r", "utf-8").read()
if existing_content != content.encode("utf-8"):
print "Content is not equal, writing file to {0}".format(filename)
# python will close the open fd for you after this
# codecs will handle the utf8 conversion before writing to the file, so no need to encode 'content'
codecs.open(filename, "w", "utf-8").write(content)
# Although, it might be necessary to write the utf-8 Byte-Order Marker first:
outF = open(filename, "w")
outF.write(codecs.BOM_UTF8)
outF.write(content.encode("utf-8"))
outF.close()
else:
print "Content is equal, nothing is written to {0}".format(filename)
这里有很多有用的信息:如何在Python中使用utf-8