我对自行车修理、电锯的使用和地沟安全的了解比Python或文本编码要多;记住这一点。。。在
Python文本编码似乎是一个长期存在的问题(我自己的问题是:Searching text files' contents with various encodings with Python?,其他问题我读过:1,2)。我试着写一些代码来猜测下面的编码。在
在有限的测试中,这段代码似乎适合我的目的,而不必知道文本编码的前三个字节以及那些数据不具有信息性的情况。在
*我的目的是:
问题:像我下面这样,使用一种我认为是一种比较和计算字符的愚蠢方法的陷阱是什么?非常感谢您的任何意见。在
def guess_encoding_debug(file_path):
"""
DEBUG - returns many 2 value tuples
Will return list of all possible text encodings with a count of the number of chars
read that are common characters, which might be a symptom of success.
SEE warnings in sister function
"""
import codecs
import string
from operator import itemgetter
READ_LEN = 1000
ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
'utf_16_be','utf_32','utf_32_le','utf_32_be']
#chars in the regular ascii printable set are BY FAR the most common
#in most files written in English, so their presence suggests the file
#was decoded correctly.
nonsuspect_chars = string.printable
#to be a list of 2 value tuples
results = []
for e in ENCODINGS:
#some encodings will cause an exception with an incompatible file,
#they are invalid encoding, so use try to exclude them from results[]
try:
with codecs.open(file_path, 'r', e) as f:
#sample from the beginning of the file
data = f.read(READ_LEN)
nonsuspect_sum = 0
#count the number of printable ascii chars in the
#READ_LEN sized sample of the file
for n in nonsuspect_chars:
nonsuspect_sum += data.count(n)
#if there are more chars than READ_LEN
#the encoding is wrong and bloating the data
if nonsuspect_sum <= READ_LEN:
results.append([e, nonsuspect_sum])
except:
pass
#sort results descending based on nonsuspect_sum portion of
#tuple (itemgetter index 1).
results = sorted(results, key=itemgetter(1), reverse=True)
return results
def guess_encoding(file_path):
"""
Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
Will return one likely text encoding, though there may be others just as likely.
WARNING: DO NOT use if your file uses any significant number of characters
outside the standard ASCII printable characters!
WARNING: DO NOT use for critical applications, this code will fail you.
"""
results = guess_encoding_debug(file_path)
#return the encoding string (second 0 index) from the first
#result in descending list of encodings (first 0 index)
return results[0][0]
我假设它会比我不太熟悉的chardet慢。也不太准确。他们的设计方式,任何罗马字符为基础的语言,使用口音,变音等将不会工作,至少不好。很难知道什么时候会失败。然而,大多数英语文本,包括大多数编程代码,在很大程度上都是用字符串。可打印此代码依赖的。在
将来可能会选择外部库,但现在我想避免使用外部库,因为:
了解代码工作情况的最简单方法可能是获取其他现有库的测试套件,并将其作为基础来创建自己的综合测试套件。他们会知道你的代码是否适用于所有这些情况,你也可以测试你关心的所有情况。在
相关问题 更多 >
编程相关推荐