Unicode类别数据库
unicategories的Python项目详细描述
单一类别
Unicode类别数据库,在安装时生成。
此模块公开包含RangeGroup
实例的类别字典。
示例
fromunicategoriesimportcategoriesupperchars=categories['Lu'].characters()# iteratorprint('Unicode uppercase caracters are "%s"'%''.join(upperchars))# Unicode uppercase caracters are "ABCDEF..."
范围组
不可变iterable(基于元组,使用一些有用的方法)的(开始,结束)
元组就像python的range
,在末尾打开。
为了提高存储效率,我们选择了这种方法,分别存储 记忆中的字符会占用大量的记忆。
rangegroup类提供以下方法:
range_group.characters()
Get iterator with all characters on this range group. :yields:iterator of characters (str of size 1):ytype:str
range_group.codes()
Get iterator for all unicode code points contained in this range group. :yields:iterator of character index (int):ytype:int
range_group.has(character)
Get if character (or character code point) is contained by any range on this range group. :param character:character or unicode code point to look for:type character:str or int:returns:True if character is contained by any range, False otherwise:rtype:bool
Unicode类别
取自wikipedia。
Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
---|---|---|---|---|---|
Lu | Letter, uppercase | Graphic | Character | ||
Ll | Letter, lowercase | Graphic | Character | ||
Lt | Letter, titlecase | Graphic | Character | Ligatures containing uppercase followed by lowercase letters (e.g., ^{ | |
Lm | Letter, modifier | Graphic | Character | ||
Lo | Letter, other | Graphic | Character | ||
Mn | Mark, nonspacing | Graphic | Character | ||
Mc | Mark, spacing combining | Graphic | Character | ||
Me | Mark, enclosing | Graphic | Character | ||
Nd | Number, decimal digit | Graphic | Character | All these, and only these, have Numeric Type = De | |
Nl | Number, letter | Graphic | Character | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) | |
No | Number, other | Graphic | Character | E.g., vulgar fractions , superscript and subscript digits | |
Pc | Punctuation, connector | Graphic | Character | Includes "_" underscore | |
Pd | Punctuation, dash | Graphic | Character | Includes several hyphen characters | |
Ps | Punctuation, open | Graphic | Character | Opening bracket characters | |
Pe | Punctuation, close | Graphic | Character | Closing bracket characters | |
Pi | Punctuation, initial quote | Graphic | Character | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage | |
Pf | Punctuation, final quote | Graphic | Character | Closing quotation mark. May behave like Ps or Pe depending on usage | |
Po | Punctuation, other | Graphic | Character | ||
Sm | Symbol, math | Graphic | Character | ||
Sc | Symbol, currency | Graphic | Character | ||
Sk | Symbol, modifier | Graphic | Character | ||
So | Symbol, other | Graphic | Character | ||
Zs | Separator, space | Graphic | Character | Includes the space, but not TAB , CR , or LF , which are Cc | |
Zl | Separator, line | Format | Character | Only U+2028 LINE SEPARATOR (LSEP) | |
Zp | Separator, paragraph | Format | Character | Only U+2029 PARAGRAPH SEPARATOR (PSEP) | |
Cc | Other, control | Control | Character | Fixed 65 | No name , ^{ |
Cf | Other, format | Format | Character | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters | |
Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , ^{ |
Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , ^{ |
Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , ^{ |
Cn | Other, not assigned | Reserved | Not | Not fixed | No name , ^{ |
除此之外,unicategories还提供一般类别L
、M
、N
、P
、S
、Z
和C
。