印度语言自然语言工具包(inltk)
inltk的Python项目详细描述
印度语自然语言工具包(inltk)
inltk旨在为各种nlp任务提供开箱即用的支持。 应用程序开发人员可能需要indic语言。
安装
pip install http://download.pytorch.org/whl/cpu/torch-1.0.0-cp36-cp36m-linux_x86_64.whl pip install inltk
inltk运行在cpu上,这是大多数 生产中的深度学习模型。
上面的第一个命令将安装pytorch cpu,顾名思义, 没有CUDA支持。
Note: inltk is currently supported only on Linux with Python >= 3.6
在Windows上安装(实验性)
pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp36-cp36m-win_amd64.whl pip install inltk
支持的语言
Language | Code <code-of-language> |
---|---|
Hindi | hi |
Punjabi | pa |
Sanskrit | sa |
Gujarati | gu |
Kannada | kn |
Malayalam | ml |
Nepali | ne |
Odia | or |
Marathi | mr |
Bengali | bn |
Tamil | ta |
Urdu | ur |
使用量
设置语言
from inltk.inltk import setup setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')
Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.
tokenize
from inltk.inltk import tokenize tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>
get嵌入向量
这将返回一个“嵌入向量”数组,其中包含 文本中的每一个标记。
from inltk.inltk import get_embedding_vectors
vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>
Example:
>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)
>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ..., 0.859898, 1.940608, 0.09252 , 1.043363], dtype=float32), array([ 0.290839, 1.459981, -0.582347, 0.27822 , ..., -0.736542, -0.259388, 0.086048, 0.736173], dtype=float32), array([ 0.069481, -0.069362, 0.17558 , -0.349333, ..., 0.390819, 0.117293, -0.194081, 2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131, 0.161678, ..., 0.048844, -1.090546, 0.154555, 0.925028], dtype=float32), array([ 0.219287, 0.759776, 0.695487, 1.097593, ..., 0.016115, -0.81602 , 0.333799, 1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479, 0.177357, ..., 0.729619, -0.161499, -0.270225, 2.083801], dtype=float32), array([-0.501414, 1.337661, -0.405563, 0.733806, ..., -0.182045, -1.413752, 0.163339, 0.907111], dtype=float32), array([ 0.185258, -0.429729, 0.060273, 0.232177, ..., -0.537831, -0.51664 , -0.249798, 1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8
要了解嵌入,请检查 this visualization of subset of Hindi Embedding vectors
预测下一个“n”字
from inltk.inltk import predict_next_words predict_next_words(text , n, '<code-of-language>') // text --> string in <code-of-language> // n --> number of words you want to predict (integer)
Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8
标识语言
注意:如果您更新了inltk的版本,则需要运行
reset_language_identifying_models
在识别语言之前。
from inltk.inltk import identify_language, reset_language_identifying_models reset_language_identifying_models()# only if you've updated iNLTK version identify_language(text) // text --> string in one of the supported languages Example: >> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')'sanskrit'
删除外语
from inltk.inltk import remove_foreign_languages remove_foreign_languages(text, '<code-of-language>') // text --> string in one of the supported languages // <code-of-language> --> code of that language whose words you want to retain Example: >> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '▁', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '▁', '<unk>', ':', '<unk>']
除宿主语言外的所有单词都将变成<unk>
,▁
表示space character
签出this notebook 通过Amol Mahajan使用inltk从 iitb_en_hi_parallel corpus
包含inltk中使用的模型的存储库
Language | Repository | Perplexity of Language model | Wikipedia Articles Dataset | Classification accuracy | Classification Kappa score |
---|---|---|---|---|---|
Hindi | NLP for Hindi | ~36 | 55,000 articles | ~79 (News Classification) | ~30 (Movie Review Classification) |
Punjabi | NLP for Punjabi | ~13 | 44,000 articles | ~89 (News Classification) | ~60 (News Classification) |
Sanskrit | NLP for Sanskrit | ~6 | 22,273 articles | ~70 (Shloka Classification) | ~56 (Shloka Classification) |
Gujarati | NLP for Gujarati | ~34 | 31,913 articles | ~91 (News Classification) | ~85 (News Classification) |
Kannada | NLP for Kannada | ~70 | 32,997 articles | ~94 (News Classification) | ~90 (News Classification) |
Malayalam | NLP for Malayalam | ~26 | 12,388 articles | ~94 (News Classification) | ~91 (News Classification) |
Nepali | NLP for Nepali | ~32 | 38,757 articles | ~97 (News Classification) | ~96 (News Classification) |
Odia | NLP for Odia | ~27 | 17,781 articles | ~95 (News Classification) | ~92 (News Classification) |
Marathi | NLP for Marathi | ~18 | 85,537 articles | ~91 (News Classification) | ~84 (News Classification) |
Bengali | NLP for Bengali | ~41 | 72,374 articles | ~94 (News Classification) | ~92 (News Classification) |
Tamil | NLP for Tamil | ~20 | >127,000 articles | ~97 (News Classification) | ~95 (News Classification) |
Urdu | NLP for Urdu | ~13 | >150,000 articles | ~94 (News Classification) | ~90 (News Classification) |
贡献
为inltk添加新的语言支持
如果您想在inltk中添加对自己选择的语言的支持, 请从检查/提出问题开始here
请检查我的步骤mentioned here for Telugu 首先。其他语言也应该差不多。
改进模型/使用模型进行自己的研究
如果您想采用inltk的模型并用自己的模型对其进行优化 数据集或在其上构建自己的自定义模型,请查看 上表中您选择的语言的存储库。上面的存储库 包含到数据集、预训练模型、分类器和所有相关代码的链接。
添加新功能
如果您希望在inltk中使用特定的功能,请从检查/提出问题开始here
下一步是什么(正在研究)
Shout out if you want to help :)
下一步是什么(而且还没开始工作)
Shout out if you want to lead :)
- 为所有语言建立统一的模型
- 以inltk+英语添加语言之间的翻译