Python,文件(1)为什么用数字[7,8,9,10,12,13,27]和范围(0x20,0x100)来确定文本与二进制fi

2024-03-28 21:02:30 发布

您现在位置:Python中文网/ 问答频道 /正文

关于solution for determining whether a file is binary or text in python,答案者使用:

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

然后使用.translate(None, textchars)删除(或不替换)以二进制形式读入的文件中的所有此类字符。在

回答者还认为,这种数字的选择是“基于文件(1)行为”(什么是文本,什么不是文本)。这些数字的意义在于从二进制文件中确定文本文件?在


Tags: or文件text文本foris二进制数字
1条回答
网友
1楼 · 发布于 2024-03-28 21:02:30

它们代表可打印文本最常见的代码点,加上换行符、空格和回车符等等。ASCII被覆盖到0x7F,而拉丁语-1或Windows代码页1251等标准将剩余的128字节用于重音字符等

你希望文本只使用那些代码点。二进制数据将使用0x00-0xFF范围内的所有码位;例如,文本文件可能不会使用\x00(NUL)或\x1F(ASCII标准中的单位分隔符)。在

不过,这充其量只是一种启发。一些文本文件可能仍然尝试在显式命名的7个字符之外使用C0 control codes,我确信存在的二进制数据碰巧不包括textchars字符串中未包含的25字节值。在

范围的作者可能基于file命令中的^{} table。它将字节标记为非文本、ASCII、Latin-1或非ISO扩展ASCII,并包含有关为什么选择这些代码点的文档:

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

有趣的是,表排除了0x7F,而您发现的代码没有。在

相关问题 更多 >