Python文本编码似乎是一个长期存在的问题(我自己的问题是:Searching text files' contents with various encodings with Python?,其他问题我读过:12)。我试着写一些代码来猜测下面的编码。在



  1. 有一个没有依赖关系的代码片段,我可以用它来获得中等程度的成功
  2. 扫描本地工作站以查找任何编码的基于文本的日志文件,并根据其内容将它们标识为我感兴趣的文件(这需要以正确的编码打开文件)
  3. 为了完成这个任务。在


def guess_encoding_debug(file_path):
    DEBUG - returns many 2 value tuples
    Will return list of all possible text encodings with a count of the number of chars
    read that are common characters, which might be a symptom of success.
    SEE warnings in sister function

    import codecs
    import string
    from operator import itemgetter

    READ_LEN = 1000
    ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\

    #chars in the regular ascii printable set are BY FAR the most common
    #in most files written in English, so their presence suggests the file
    #was decoded correctly.
    nonsuspect_chars = string.printable

    #to be a list of 2 value tuples
    results = []

    for e in ENCODINGS:
        #some encodings will cause an exception with an incompatible file,
        #they are invalid encoding, so use try to exclude them from results[]
            with codecs.open(file_path, 'r', e) as f:

                #sample from the beginning of the file
                data = f.read(READ_LEN)

                nonsuspect_sum = 0

                #count the number of printable ascii chars in the
                #READ_LEN sized sample of the file
                for n in nonsuspect_chars:
                    nonsuspect_sum += data.count(n)

                #if there are more chars than READ_LEN
                #the encoding is wrong and bloating the data
                if nonsuspect_sum <= READ_LEN:
                    results.append([e, nonsuspect_sum])

    #sort results descending based on nonsuspect_sum portion of
    #tuple (itemgetter index 1).
    results = sorted(results, key=itemgetter(1), reverse=True)

    return results

def guess_encoding(file_path):
    Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
    Will return one likely text encoding, though there may be others just as likely.
    WARNING: DO NOT use if your file uses any significant number of characters
             outside the standard ASCII printable characters!
    WARNING: DO NOT use for critical applications, this code will fail you.

    results = guess_encoding_debug(file_path)

    #return the encoding string (second 0 index) from the first
    #result in descending list of encodings (first 0 index)
    return results[0][0]



  1. 这个脚本将在网络上和网络外的多台公司计算机上运行,使用不同版本的python,因此复杂程度越低越好。当我说“公司”时,我指的是小型的非营利的社会科学家。在
  2. 我负责收集GPS数据处理的日志,但我不是系统管理员-她不是python程序员,我占用她的时间越少越好。在
  3. Python的安装通常在我的公司提供,它是与GIS软件包一起安装的,如果单独使用它通常会更好。在
  4. 我的要求不是太严格,我只想确定我感兴趣的文件,并使用其他方法将它们复制到存档中。我不是把全部内容读入内存来操作、附加或重写内容。在
  5. 似乎一个高级编程语言应该有一些独立完成这一任务的方法。虽然“似乎”是任何努力的摇摇欲坠的基础,我想尝试看看是否能让它发挥作用。在

