如何确定字符在html中是否有效?

2024-03-28 05:04:07 发布

您现在位置:Python中文网/ 问答频道 /正文

有些字符,如序号22或8,不显示在html中(使用chrome,例如在复制和粘贴到这个“提问”编辑器时;我假设是utf-8)。如何确定哪些字符是有效的html,哪些字符是有效的,哪些字符是呈现的?你知道吗

一个表/引用会很有帮助(我在googleing上找不到),但我最好需要一组规则或一个可以用python实现的解决方案。你知道吗


Tags: 粘贴规则htmlchrome解决方案编辑器字符utf
2条回答

正如Blender的评论所回答的,来自wikipedia

HTML forbids[8] the use of the characters with Universal Character Set/Unicode code points

  • 0 to 31, except 9, 10, and 13 (C0 control characters)
  • 127 (DEL character)
  • 128 to 159 (x80 – x9F, C1 control characters)
  • 55296 to 57343 (xD800 – xDFFF, the UTF-16 surrogate halves)

The Unicode standard also forbids:

  • 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the byte order mark.

These characters are not even allowed by reference. That is, you should not even write them as numeric character references. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to bytes 128–159 (decimal) in the Windows-1252 character encoding. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML document authors should always use the higher code points. For example, for the trademark sign (™), use ™, not ™.

The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered "whitespace".[9] The "form feed" control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the "white space" characters – perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a block, are interpreted as comprising a single "word separator" for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in all the others.

什么是HTML中的有效字符取决于“HTML”和“valid”的定义。不同的HTML版本对形式上有效的字符有不同的规则,它们可能包含有效但不推荐的字符。此外,还有一些通用的策略,比如支持规范化表单C;虽然这些策略不是HTML规范的一部分,但通常也被认为与HTML相关。你知道吗

呈现什么(以及如何呈现)取决于浏览器、HTML文档的样式表和用户计算机中可用的字体。此外,并非所有角色都是这样呈现的。例如,在普通HTML内容中,任何连续的空格字符序列都被视为等效于单个空格字符。你知道吗

所以答案其实是“视情况而定”,考虑问一个更有针对性的实际问题,得到更有针对性的答案。你知道吗

相关问题 更多 >