显然，Python字符串并不是“天生平等的”

############################## # TEST ON THE ANSI-coded # # FILE # ############################## import os file = open(os.getcwd() + '\\myAnsi.txt', 'r') fileText = file.read() file.close() file = open(os.getcwd() + '\\outputAnsi.txt', 'w') file.write(fileText) file.close() # A print statement here like: # >> print(fileText) # will raise an exception. # But if you're typing this code in a python terminal, # you can just write: # >> fileText # and get the content printed. In my case, it is the exact # content of the file. # PS: I use the native windows cmd.exe as my Python terminal ;-) ############################## # TEST ON THE Utf-coded # # FILE # ############################## import os file = open(os.getcwd() + '\\myUtf.txt', 'r') fileText = file.read() file.close() file = open(os.getcwd() + '\\outputUtf.txt', 'w') file.write(fileText) file.close() # A print statement here like: # >> print(fileText) # will just work fine (at least for me). ############# END OF TEST #############

/* ****************************************************************************** ** ** File : LinkerScript.ld ** ** Author : Auto-generated by Ac6 System Workbench ** ** Abstract : Linker script for STM32F746NGHx Device from STM32F7 series ** ** Target : STMicroelectronics STM32 ** ** Distribution: The file is distributed “as is,” without any warranty ** of any kind. ** ***************************************************************************** ** @attention ** ** <h2><center>© COPYRIGHT(c) 2014 Ac6</center></h2> ** ***************************************************************************** */ /* Entry Point */ /*ENTRY(Reset_Handler)*/ ENTRY(Default_Handler) /* Highest address of the user mode stack */ _estack = 0x20050000; /* end of RAM */ _Min_Heap_Size = 0; /* required amount of heap */ _Min_Stack_Size = 0x400; /* required amount of stack */ /* Memories definition */ MEMORY { RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 320K ROM (rx) : ORIGIN = 0x8000000, LENGTH = 1024K }

>>> print(fileText) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Anaconda3\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 357: character maps to <undefined>

/*--------------------------------------------------------------------------------------------------------------------*/ /* _ _ _ */ /* / -,- \ __ _ _ */ /* // | \\ / __\ | ___ ___| | __ _ _ */ /* | 0--,| / / | |/ _ \ / __| |/ / __ ___ _ _ __| |_ __ _ _ _| |_ ___ */ /* \\ // / /___| | (_) | (__| < / _/ _ \ ' \(_-< _/ _` | ' \ _(_-< */ /* \_-_-_/ \____/|_|\___/ \___|_|\_\ \__\___/_||_/__/\__\__,_|_||_\__/__/ */ /*--------------------------------------------------------------------------------------------------------------------*/ #include "clock_constants.h" #include "../CMSIS/stm32f7xx.h" #include "stm32f7xx_hal_rcc.h" /*--------------------------------------------------------------------------------------------------*/ /* S y s t e m C o r e C l o c k i n i t i a l v a l u e */ /*--------------------------------------------------------------------------------------------------*/ /* */ /* This variable is updated in three ways: */ /* 1) by calling CMSIS function SystemCoreClockUpdate() */ /* 2) by calling HAL API function HAL_RCC_GetHCLKFreq() */ /* 3) each time HAL_RCC_ClockConfig() is called to configure the system clock frequency */ /* Note: If you use this function to configure the system clock; then there */ /* is no need to call the 2 first functions listed above, since SystemCoreClock */ /* variable is updated automatically. */ /* */ uint32_t SystemCoreClock = 16000000; const uint8_t AHBPrescTable[16] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 6, 7, 8, 9}; /*--------------------------------------------------------------------------------------------------*/ /* S y s t e m C o r e C l o c k v a l u e u p d a t e */ /*--------------------------------------------------------------------------------------------------*/ /* */ /* @brief Update SystemCoreClock variable according to Clock Register Values. */ /* The SystemCoreClock variable contains the core clock (HCLK), it can */ /* be used by the user application to setup the SysTick timer or configure */ /* other parameters. */ /*--------------------------------------------------------------------------------------------------*/

3条回答

网友

1楼 · 编辑于 2024-05-23 15:47:24

为了完全理解答案，我们需要看一下文档。在

让我们从open（）函数开始。根据Python文档

open() returns a file object, and is most commonly used with two arguments: open(filename, mode). 1

这意味着我们处理的是一个file对象，它可能意味着原始二进制文件、缓冲二进制文件或在本例中是文本文件2。但是这个文本文件对象怎么知道它在编码呢？好吧，根据文件

A file object able to read and write str objects. Often, a text file actually accesses a byte-oriented datastream and handles the text encoding automatically.3

我们有了它，它是自动控制的。因为这两种格式都属于支持的编解码器。Python知道如何在给定file对象的情况下对文件进行编码。在

网友

2楼 · 编辑于 2024-05-23 15:47:24

当没有显式传递编码时，^{} uses the preferred system encoding同时用于读和写（不确定如何在Windows上检测到首选的编码）。在

所以，当你写下：

file = open(os.getcwd() + '\\myAnsi.txt', 'r')
file = open(os.getcwd() + '\\outputAnsi.txt', 'w')
file = open(os.getcwd() + '\\myUtf.txt', 'r')
file = open(os.getcwd() + '\\outputUtf.txt', 'w')

所有四个文件都使用相同的编码打开，无论是读还是写。在

如果要确保使用以下编码打开文件，则必须传递encoding='cp1252'或{}：

^{pr2}$

（顺便说一句，我不是Windows专家，但我认为您可以用'myAnsi.txt'代替{}。）

除此之外，您还必须考虑到某些字符以相同的方式用不同的编码表示。例如，字符串hello在ASCII、CP-1252或UTF-8中具有相同的表示形式。通常，您必须使用一些非ASCII字符才能看到一些差异：

>>> 'hello'.encode('cp1252')
b'hello'
>>> 'hello'.encode('utf-8')
b'hello'  # different encoding, same byte representation

不仅如此，一些字节字符串在两种不同的编码中都是完全有效的，即使它们可能有不同的含义，因此当你试图用错误的编码解码一个文件时，你不会得到一个错误，而是一个奇怪的字符串：

>>> b'\xe2\x82\xac'.decode('utf-8')
'€'
>>> b'\xe2\x82\xac'.decode('cp1252')
'â‚¬'  # same byte representation, different string

对于记录，Python uses UTF-8, UTF-16 or UTF-32在内部表示字符串。Python尝试使用“最短”表示，即使使用UTF-8和UTF-16时没有连续字节，因此查找总是O（1）。在

简而言之，您已经使用系统编码读取了两个文件，并使用相同的编码编写了两个文件（因此没有任何转换）。您所读文件的内容与CP-1252和UTF-8兼容。在

网友

3楼 · 编辑于 2024-05-23 15:47:24

CP-1252基本上是一个字节对字节的编解码器；它可以解码任意字节，包括来自UTF-8编码的字节。因此，有效地，假设您在使用西方语言环境的Windows上，open提供的默认编码是cp-1252，如果您从不使用Python中的字符串，只需读写它，那么您也可以只以二进制模式读写。只有在尝试以暴露问题的方式使用字符串时，才会看到问题。在

例如，考虑以下测试文件，其中包含一个UTF-8编码字符：

with open('utf8file.txt', 'w', encoding='utf-8') as f:
    f.write('é')

该文件中的实际字节是C3 A9。在

如果您在cp-1252中读取该文件，它会很乐意这样做，因为每个字节都是合法的cp-1252字节：

^{pr2}$

您的问题是，大多数文件都是合法的cp-1252文本文件（Python可能会将未分配的字节作为等效的Unicode序号进行静默读取；我知道对于latin-1这样的未分配字节，\x8d也是如此），当它们合法时，以相同的编码进行读取和写回是不可变的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章