同形异义词
homoglyphs的Python项目详细描述
homogyphs–用于获取homoglyphs并转换为ascii的python库。
功能
这是更聪明的confusable_homoglyphs:
- 自动或手动选择类别(aliases from ISO 15924)。
- 自动或手动加载只需要在内存中的字母。
- 正在转换为ascii。
- 更可配置。
- 更稳定。
安装
sudo pip install homoglyphs
用法
最好的解释方法是展示它是如何工作的。所以,让我们看看实际的用法。
导入:
importhomoglyphsashg
语言
#detecthg.Languages.detect('w')# {'pl', 'da', 'nl', 'fi', 'cz', 'sr', 'pt', 'it', 'en', 'es', 'sk', 'de', 'fr', 'ro'}hg.Languages.detect('т')# {'mk', 'ru', 'be', 'bg', 'sr'}hg.Languages.detect('.')# set()# get alphabet for languageshg.Languages.get_alphabet(['ru'])# {'в', 'Ё', 'К', 'Т', ..., 'Р', 'З', 'Э'}# get all languageshg.Languages.get_all()# {'nl', 'lt', ..., 'de', 'mk'}
类别
类别–(aliases from ISO 15924)。
#detecthg.Categories.detect('w')# 'LATIN'hg.Categories.detect('т')# 'CYRILLIC'hg.Categories.detect('.')# 'COMMON'# get alphabet for categorieshg.Categories.get_alphabet(['CYRILLIC'])# {'ӗ', 'Ԍ', 'Ґ', 'Я', ..., 'Э', 'ԕ', 'ӻ'}# get all categorieshg.Categories.get_all()# {'RUNIC', 'DESERET', ..., 'SOGDIAN', 'TAI_LE'}
同形文字
获取同形文字:
# get homoglyphs (latin alphabet initialized by default)hg.Homoglyphs().get_combinations('q')# ['q', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?']
字母表加载:
# load alphabet on init by categorieshomoglyphs=hg.Homoglyphs(categories=('LATIN','COMMON','CYRILLIC'))# alphabet loaded herehomoglyphs.get_combinations('гы')# ['rы', 'гы', 'ꭇы', 'ꭈы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы', '?ы']# load alphabet on init by languageshomoglyphs=hg.Homoglyphs(languages={'ru','en'})# alphabet will be loaded herehomoglyphs.get_combinations('гы')# ['rы', 'гы']# manual set alphabet on init # eng rushomoglyphs=hg.Homoglyphs(alphabet='abc абс')homoglyphs.get_combinations('с')# ['c', 'с']# load alphabet on demandhomoglyphs=hg.Homoglyphs(languages={'en'},strategy=hg.STRATEGY_LOAD)# ^ alphabet will be loaded here for "en" languagehomoglyphs.get_combinations('гы')# ^ alphabet will be loaded here for "ru" language# ['rы', 'гы']
您可以根据需要将categories、languages、alphabet和任何策略组合起来。策略指定如何处理尚未加载的任何字符:
- STRATEGY_LOAD:加载此字符的类别
- STRATEGY_IGNORE:向结果中添加字符
- STRATEGY_REMOVE:从结果中删除字符
将字形转换为ascii字符
homoglyphs=hg.Homoglyphs(languages={'en'},strategy=hg.STRATEGY_LOAD)# converthomoglyphs.to_ascii('тест')# ['tect']homoglyphs.to_ascii('ХР123.')# this is cyrillic "х" and "р"# ['XP123.', 'XPI23.', 'XPl23.']# string with chars which can't be converted by default will be ignoredhomoglyphs.to_ascii('лол')# []# you can set strategy for removing not converted non-ASCII chars from resulthomoglyphs=hg.Homoglyphs(languages={'en'},strategy=hg.STRATEGY_LOAD,ascii_strategy=hg.STRATEGY_REMOVE,)homoglyphs.to_ascii('лол')# ['o']# also you can set up range of allowed char codes for ascii (0-128 by default):homoglyphs=hg.Homoglyphs(languages={'en'},strategy=hg.STRATEGY_LOAD,ascii_strategy=hg.STRATEGY_REMOVE,ascii_range=range(ord('a'),ord('z')),)homoglyphs.to_ascii('ХР123.')# ['l']homoglyphs.to_ascii('хр123.')# ['xpl']