从亚洲文本中获取Python字符
大家好,我遇到一个问题。
我有一个词“बन्दूक”,在计数记事本里显示有3个字符,但用下面的代码“_charaters = list(line)
”却显示有6个字符。
我该怎么才能只得到这3个字符呢?
举个例子:
- ब
- न्दू
- क
2 个回答
0
另一种方法是使用 pyicu
来进行字符分割,这里用到了一个叫做断点迭代器的工具。ICU4C 提供了字符、单词和句子的断点迭代器,适用于多种语言环境。
import icu
def get_boundaries(loc, s):
bi = icu.BreakIterator.createCharacterInstance(loc)
bi.setText(s)
boundaries = [*bi]
boundaries.insert(0, 0)
return boundaries
def get_graphemes(loc, text):
boundary_indices = get_boundaries(loc, text)
return [text[boundary_indices[i]:boundary_indices[i+1]] for i in range(len(boundary_indices)-1)]
print(get_graphemes(icu.Locale('hi'), "बन्दूक"))
# ['ब', 'न्दू', 'क']
1
也许你在寻找 pyuegc
模块:
这个模块实现了一个算法,用来把文本字符串(也就是字符序列)分解成扩展的字形簇(也就是“用户感知的字符”),这个过程是根据 UAX #29,“Unicode 文本分段” 的规定来进行的。
示例(部分注释,字符串 "बन्दूक" 是硬编码的):
from pyuegc import EGC
def _output(unistr, egc):
return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""
unistr = "बन्दूक"
egcs = EGC(unistr)
print(_output(unistr, egcs))
# above code basically copied from https://pypi.org/project/pyuegc/
# below code for deeper insight into the GEC results
import json
print( '\n' + json.dumps(unistr))
print( json.dumps(egcs) + '\n')
import unicodedata
for egc in egcs:
print(f'\nEGC {egc} {json.dumps(egc)}')
for uch in egc:
print( f'char {uch} {json.dumps(uch)} {unicodedata.name(uch, "???")}')
结果: .\SO\78102711.py
# String: बन्दूक
# Length of string: 6
# EGC: ['ब', 'न्दू', 'क']
# Length of EGC: 3
"\u092c\u0928\u094d\u0926\u0942\u0915"
["\u092c", "\u0928\u094d\u0926\u0942", "\u0915"]
EGC ब "\u092c"
char ब "\u092c" DEVANAGARI LETTER BA
EGC न्दू "\u0928\u094d\u0926\u0942"
char न "\u0928" DEVANAGARI LETTER NA
char ् "\u094d" DEVANAGARI SIGN VIRAMA
char द "\u0926" DEVANAGARI LETTER DA
char ू "\u0942" DEVANAGARI VOWEL SIGN UU
EGC क "\u0915"
char क "\u0915" DEVANAGARI LETTER KA