Python inltk包_程序模块 - PyPI

印度语言自然语言工具包（inltk）

inltk的Python项目详细描述

印度语自然语言工具包（inltk）

inltk旨在为各种nlp任务提供开箱即用的支持。应用程序开发人员可能需要indic语言。

Alt Text

安装

pip install http://download.pytorch.org/whl/cpu/torch-1.0.0-cp36-cp36m-linux_x86_64.whl
pip install inltk

inltk运行在cpu上，这是大多数生产中的深度学习模型。

上面的第一个命令将安装pytorch cpu，顾名思义，没有CUDA支持。

Note: inltk is currently supported only on Linux with Python >= 3.6

在Windows上安装（实验性）

pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp36-cp36m-win_amd64.whl
pip install inltk

支持的语言

Language	Code <code-of-language>
Hindi	hi
Punjabi	pa
Sanskrit	sa
Gujarati	gu
Kannada	kn
Malayalam	ml
Nepali	ne
Odia	or
Marathi	mr
Bengali	bn
Tamil	ta
Urdu	ur

使用量

设置语言

from inltk.inltk import setup

setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')

Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.

tokenize

from inltk.inltk import tokenize

tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>

get嵌入向量

这将返回一个“嵌入向量”数组，其中包含文本中的每一个标记。

from inltk.inltk import get_embedding_vectors

vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>

Example:

>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)

>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ...,  0.859898,  1.940608,  0.09252 ,  1.043363], dtype=float32), array([ 0.290839,  1.459981, -0.582347,  0.27822 , ..., -0.736542, -0.259388,  0.086048,  0.736173], dtype=float32), array([ 0.069481, -0.069362,  0.17558 , -0.349333, ...,  0.390819,  0.117293, -0.194081,  2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131,  0.161678, ...,  0.048844, -1.090546,  0.154555,  0.925028], dtype=float32), array([ 0.219287,  0.759776,  0.695487,  1.097593, ...,  0.016115, -0.81602 ,  0.333799,  1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479,  0.177357, ...,  0.729619, -0.161499, -0.270225,  2.083801], dtype=float32), array([-0.501414,  1.337661, -0.405563,  0.733806, ..., -0.182045, -1.413752,  0.163339,  0.907111], dtype=float32), array([ 0.185258, -0.429729,  0.060273,  0.232177, ..., -0.537831, -0.51664 , -0.249798,  1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8

要了解嵌入，请检查 this visualization of subset of Hindi Embedding vectors

预测下一个“n”字

from inltk.inltk import predict_next_words

predict_next_words(text , n, '<code-of-language>') 

// text --> string in <code-of-language>
// n --> number of words you want to predict (integer)

Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8

标识语言

注意：如果您更新了inltk的版本，则需要运行 reset_language_identifying_models在识别语言之前。

from inltk.inltk import identify_language, reset_language_identifying_models

reset_language_identifying_models()# only if you've updated iNLTK version
identify_language(text)

// text --> string in one of the supported languages

Example:

>> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')'sanskrit'

删除外语

from inltk.inltk import remove_foreign_languages

remove_foreign_languages(text, '<code-of-language>')

// text --> string in one of the supported languages
// <code-of-language> --> code of that language whose words you want to retain

Example:

>> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '▁', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '▁', '<unk>', ':', '<unk>']

除宿主语言外的所有单词都将变成<unk>，▁表示space character

签出this notebook 通过Amol Mahajan使用inltk从 iitb_en_hi_parallel corpus

包含inltk中使用的模型的存储库
Language Repository Perplexity of Language model Wikipedia Articles Dataset Classification accuracy Classification Kappa score
Hindi NLP for Hindi ~36 55,000 articles ~79 (News Classification) ~30 (Movie Review Classification)
Punjabi NLP for Punjabi ~13 44,000 articles ~89 (News Classification) ~60 (News Classification)
Sanskrit NLP for Sanskrit ~6 22,273 articles ~70 (Shloka Classification) ~56 (Shloka Classification)
Gujarati NLP for Gujarati ~34 31,913 articles ~91 (News Classification) ~85 (News Classification)
Kannada NLP for Kannada ~70 32,997 articles ~94 (News Classification) ~90 (News Classification)
Malayalam NLP for Malayalam ~26 12,388 articles ~94 (News Classification) ~91 (News Classification)
Nepali NLP for Nepali ~32 38,757 articles ~97 (News Classification) ~96 (News Classification)
Odia NLP for Odia ~27 17,781 articles ~95 (News Classification) ~92 (News Classification)
Marathi NLP for Marathi ~18 85,537 articles ~91 (News Classification) ~84 (News Classification)
Bengali NLP for Bengali ~41 72,374 articles ~94 (News Classification) ~92 (News Classification)
Tamil NLP for Tamil ~20 >127,000 articles ~97 (News Classification) ~95 (News Classification)
Urdu NLP for Urdu ~13 >150,000 articles ~94 (News Classification) ~90 (News Classification)

Language	Repository	Perplexity of Language model	Wikipedia Articles Dataset	Classification accuracy	Classification Kappa score
Hindi	NLP for Hindi	~36	55,000 articles	~79 (News Classification)	~30 (Movie Review Classification)
Punjabi	NLP for Punjabi	~13	44,000 articles	~89 (News Classification)	~60 (News Classification)
Sanskrit	NLP for Sanskrit	~6	22,273 articles	~70 (Shloka Classification)	~56 (Shloka Classification)
Gujarati	NLP for Gujarati	~34	31,913 articles	~91 (News Classification)	~85 (News Classification)
Kannada	NLP for Kannada	~70	32,997 articles	~94 (News Classification)	~90 (News Classification)
Malayalam	NLP for Malayalam	~26	12,388 articles	~94 (News Classification)	~91 (News Classification)
Nepali	NLP for Nepali	~32	38,757 articles	~97 (News Classification)	~96 (News Classification)
Odia	NLP for Odia	~27	17,781 articles	~95 (News Classification)	~92 (News Classification)
Marathi	NLP for Marathi	~18	85,537 articles	~91 (News Classification)	~84 (News Classification)
Bengali	NLP for Bengali	~41	72,374 articles	~94 (News Classification)	~92 (News Classification)
Tamil	NLP for Tamil	~20	>127,000 articles	~97 (News Classification)	~95 (News Classification)
Urdu	NLP for Urdu	~13	>150,000 articles	~94 (News Classification)	~90 (News Classification)

贡献

为inltk添加新的语言支持

如果您想在inltk中添加对自己选择的语言的支持，请从检查/提出问题开始here

请检查我的步骤mentioned here for Telugu 首先。其他语言也应该差不多。

改进模型/使用模型进行自己的研究

如果您想采用inltk的模型并用自己的模型对其进行优化数据集或在其上构建自己的自定义模型，请查看上表中您选择的语言的存储库。上面的存储库包含到数据集、预训练模型、分类器和所有相关代码的链接。

添加新功能

如果您希望在inltk中使用特定的功能，请从检查/提出问题开始here

下一步是什么（正在研究）

Shout out if you want to help :)

添加Telugu 以及Maithili支持
添加NER支持
添加文本蕴涵支持
在inltk中添加英语

下一步是什么（而且还没开始工作）

Shout out if you want to lead :)

为所有语言建立统一的模型
以inltk+英语添加语言之间的翻译

对inltk

欢迎加入QQ群-->： 979659372

inltk 0.5.0

inltk的Python项目详细描述

印度语自然语言工具包（inltk）

安装

在Windows上安装（实验性）

支持的语言

使用量

贡献

下一步是什么（正在研究）

下一步是什么（而且还没开始工作）

对inltk

推荐PyPI第三方库

http-tarpit

stringutils

pathlib2

regenwolken

errbot-backend-webapp

django-auth0-auth

qz7.subprocess-w

pylpconcat

pybrainyquote

holodeck

django-purls

itkdb

bump

SQLObject2

deconvoluted

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

inltk 0.5.0

inltk的Python项目详细描述

印度语自然语言工具包（inltk）

安装

在Windows上安装（实验性）

支持的语言

使用量

贡献

下一步是什么（正在研究）

下一步是什么（而且还没开始工作）

对inltk

推荐PyPI第三方库

http-tarpit

stringutils

pathlib2

regenwolken

errbot-backend-webapp

django-auth0-auth

qz7.subprocess-w

pylpconcat

pybrainyquote

holodeck

django-purls

itkdb

bump

SQLObject2

deconvoluted

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签