Python 2.7.8的默认编码是什么?

2024-04-25 21:51:07 发布

您现在位置:Python中文网/ 问答频道 /正文

当我用codecs.open('f.txt', 'r', encoding=None)打开一个文件时,python2.7.8选择了一些默认编码。在

是哪一个?这些记录在哪里?在

Some experimentation显示默认编码不是utf-8asciisys.getdefaultencoding()locale.getpreferredencoding()或{}。在

编辑(阐明我的动机):我想知道当我运行这样一个脚本时,Python2.7.8选择了哪种编码:

f = codecs.open('f.txt', 'r', encoding=None) # or equivalently: f=open('f.txt')
for line in f:
    print len(line) # obviously SOME encoding has been chosen if I can print the number of characters

我对猜测文件编码的其他方法不感兴趣。在


Tags: 文件txtnone编码sys记录lineascii
2条回答

它基本上不会做任何透明的编码/解码-它只是打开文件并返回它。在

这是图书馆的代码:

def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):

    """ Open an encoded file using the given mode and return
        a wrapped version providing transparent encoding/decoding.
        Note: The wrapped version will only accept the object format
        defined by the codecs, i.e. Unicode objects for most builtin
        codecs. Output is also codec dependent and will usually be
        Unicode as well.
        Files are always opened in binary mode, even if no binary mode
        was specified. This is done to avoid data loss due to encodings
        using 8-bit values. The default file mode is 'rb' meaning to
        open the file in binary read mode.
        encoding specifies the encoding which is to be used for the
        file.
        errors may be given to define the error handling. It defaults
        to 'strict' which causes ValueErrors to be raised in case an
        encoding error occurs.
        buffering has the same meaning as for the builtin open() API.
        It defaults to line buffered.
        The returned wrapped file object provides an extra attribute
        .encoding which allows querying the used encoding. This
        attribute is only available if an encoding was specified as
        parameter.
    """
    if encoding is not None:
        if 'U' in mode:
            # No automatic conversion of '\n' is done on reading and writing
            mode = mode.strip().replace('U', '')
            if mode[:1] not in set('rwa'):
                mode = 'r' + mode
        if 'b' not in mode:
            # Force opening of the file in binary mode
            mode = mode + 'b'
    file = __builtin__.open(filename, mode, buffering)
    if encoding is None:
        return file
    info = lookup(encoding)
    srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
    # Add attributes to simplify introspection
    srw.encoding = encoding
    return srw

如您所见,如果encoding为None,它只返回打开的文件。在

以下是您的文件,每个字节以十进制表示,并显示其相应的ascii字符:

^{pr2}$

在ascii中打开它时遇到的问题是十进制值为180的字节。Ascii码最多只能达到127。所以这让我想到这一定是某种扩展的ascii,128-255用于额外的符号。在仔细阅读了wikipedia关于ascii(https://en.wikipedia.org/wiki/ASCII)的文章之后,它提到了一个流行的ascii扩展名windows-1252。在windows-1252中,十进制值180映射到锐音符(')。然后我决定搜索你文件中的字符串,看看它实际上与什么相关。这是我发现“哈佛杯30周年”http://www.365chess.com/tournaments/Harvard_Cup_30%C2%B4_1989/21650

所以在夏天,正确的编码方式可能是windows-1252。这是我的测试程序:

import codecs
with codecs.open('f.txt', 'r', encoding='windows-1252') as f:
    print f.read()

输出

... 0-1

[Event "Harvard Cup 30´"]
...

当读取文件时,使用codecs.open('f.txt','r',encoding=None)返回字节字符串,而不是Unicode字符串。它根本不尝试用编码来解码文件数据。它相当于open('f.txt','r')。接收到的长度是存储在文件中的行中的单个字节数,不进行转换。在

一个小例子:

>>> import codecs
>>> codecs.open('f.txt','r',encoding=None).read()
'abc\n'
>>> codecs.open('f.txt','r',encoding='ascii').read() # Note Unicode string returned.
u'abc\r\n'
>>> open('f.txt','r').read()
'abc\n'

相关问题 更多 >