使用Unicode标志符号搜索[AB]
dirtytext的Python项目详细描述
目录文本
使用Unicode标志符号搜索[AB]。
安装
dirtytext包可以通过pip:snake::
$ pip install dirtytext
或者从GitHub下载。
快速游览:
常用选项:
- 从文件读取:-f<;filename>;
- 保存修改过的文本:-s<;文件>;
- 文本筛选器:-筛选器
- 管道模式:-p
:mag_right:查找零宽度字符:
$> echo "This text contains zero-width chars" | dirtytext --zero -v
将产生以下输出:
Contains zero-width characters: True JSON: [{"idx": 0, "char": "\ufeff", "cval": "FEFF", "infos": null}, {"idx": 10, "char": "\u200c", "cval": "200C", "infos": null}, {"idx": 11, "char": "\u200c", "cval": "200C", "infos": null}, ...]
:mag_right:查找可混淆字符:
$> echo "hello" | dirtytext --confusables greek -v
将产生以下输出:
Contains confusables characters: True JSON: [{"idx": 2, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, {"idx": 3, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, {"idx": 4, "char": "o", "cval": "006F", "infos": [{"target": "03BF", "description": "GREEK SMALL LETTER OMICRON"}, {"target": "03C3", "description": "GREEK SMALL LETTER SIGMA"}]}]
:mag_right:查找并过滤拉丁文中的异常:
example.txt: It ⅽan be argueⅾ that the ⅽomputer ⅰs humanⅰty’s attempt to repⅼⅰⅽate the human brain. This ⅰs perhaps an unattainable goal. However, unattainable goals often lead to outstanding accomplishment.
$> dirtytext -f example.txt --lsubs --filter -s out.txt
out.txt: It can be argued that the computer is humanity’s attempt to replicate the human brain. This is perhaps an unattainable goal. However, unattainable goals often lead to outstanding accomplishment.
单一数据库
组成DirtyText数据库的Unicode数据是从Unicode Consortium中提取的, 特别是dirtytext/data目录中有两个数据库文件:
如果DrTyTyth/DATA不存在,DT在执行所需操作之前下载并建立数据库, 之后,您可以通过添加--update选项来强制数据库更新
许可证
根据GPL-3.0发布