未提供项目说明
ekushe的Python项目详细描述
伊库什
“Ekushey”是首个结构化且经济高效的孟加拉语自然语言处理工具包
电流模块
feature_extraction is a Bangla Natural Language Processing based feature extractor
特征提取
安装
^{pr2}$示例
1。计数矢量器
- 拟合n变换
- 转换
- 获取单词集
Fit n转换
fromekushey.feature_extractionimportCountVectorizerct=CountVectorizer()X=ct.fit_transform(X)# X is the word features
输出:
the countVectorized matrix form of given features
Transform
fromekushey.feature_extractionimportCountVectorizerct=CountVectorizer()get_mat=ct.transform("রাহাত")
输出:
the countVectorized matrix form of given word
Get Wordset
fromekushey.feature_extractionimportCountVectorizerct=CountVectorizer()ct.get_wordSet()
输出:
get the raw wordset used in training model
2。HashVectorizer
- 拟合n变换
- 转换
fromekushey.feature_extractionimportHashVectorizercorpus=['আমাদের দেশ বাংলাদেশ','আমার বাংলা']Vectorizer=HashVectorizer()n_features=8X=Vectorizer.fit_transform(corpus,n_features)corpus_t=["আমাদের দেশ অনেক সুন্দর"]Xf=Vectorizer.transform(corpus_t)print(X.shape,Xf.shape)print("=====================================")print(X)print("=====================================")print(Xf)
输出:
(2, 8) (1, 8)
=====================================
(0, 7) -1.0
(1, 7) -1.0
=====================================
(0, 0) 0.5773502691896258
(0, 2) 0.5773502691896258
(0, 7) -0.5773502691896258
Get Wordset
3。TfIdf
- 拟合n变换
- 转换
- 系数
Fit n转换
fromekushey.feature_extractionimportTfIdfVectorizerk=TfIdfVectorizer()doc=["কাওছার আহমেদ","শুভ হাইদার"]matrix1=k.fit_transform(doc)print(matrix1)
输出:
[[0.150515 0.150515 0. 0. ]
[0. 0. 0.150515 0.150515]]
Transform
fromekushey.feature_extractionimportTfIdfVectorizerk=TfIdfVectorizer()doc=["আহমেদ সুমন","কাওছার করিম"]matrix2=k.transform(doc)print(matrix2)
输出:
[[0.150515 0. 0. 0. ]
[0. 0.150515 0. 0. ]]
系数
fromekushey.feature_extractionimportTfIdfVectorizerk=TfIdfVectorizer()doc=["কাওছার আহমেদ","শুভ হাইদার"]k.fit_transform(doc)wordset,idf=k.coefficients()print(wordset)#Output: ['আহমেদ', 'কাওছার', 'হাইদার', 'শুভ']print(idf)'''Output: {'আহমেদ': 0.3010299956639812, 'কাওছার': 0.3010299956639812, 'হাইদার': 0.3010299956639812, 'শুভ': 0.3010299956639812}'''
4。单词嵌入
- 在
Word2Vec
- 培训
- 获取词向量
- 获取相似性
- 得到n个相似的单词
- 获取中间词
- 得到奇怪的词
- 求相似图
Training
fromekushey.feature_extractionimportBN_Word2Vec#Training Against Sentencesw2v=BN_Word2Vec(sentences=[['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা']])w2v.train()#Training Against one Text Corpusw2v=BN_Word2Vec(corpus_file="path_to_corpus.txt")w2v.train()#Training Against Multiple corpuses''' path ->corpus ->1.txt ->2.txt ->3.txt'''w2v=BN_Word2Vec(corpus_path="path/corpus")w2v.train(epochs=25)#Training Against a Dataframe Columnw2v=BN_Word2Vec(df=news_data['text_content'])w2v.train(epochs=25)
训练完成后,模型“w2v_模型”及其支持向量文件将被保存到当前目录。在
如果使用任何预先训练的模型,请在初始化BN\u Word2Vec()时指定它。否则不需要型号名称。
Get Word Vector
fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_wordVector('আমার')
获取相似性
fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_similarity('ঢাকা','রাজধানী')
输出:
67.457879
Get n个相似单词
fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_n_similarWord(['পদ্মা'],n=10)
输出:
^{pr21}$Get中间词
得到中心词给定词表的概率分布。在
fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_outputWord(['ঢাকায়','মৃত্যু'],n=2)
输出:
[("হয়েছে।',", 0.05880642), ('শ্রমিকের', 0.05639163)]
Get奇数单词
从给定单词列表中找出最不匹配的单词
fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_oddWords(['চাল','ডাল','চিনি','আকাশ'])
输出:
'আকাশ'
获取相似性图
创建具有概率的相似单词的条形图
fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_similarity_plot('চাউল',5)
- 在
快速文本
- 培训
- 获取词向量
- 获取相似性
- 得到n个相似的单词
- 获取中间词
- 得到奇怪的词
Training
fromekushey.feature_extractionimportBN_FastText#Training Against Sentencesft=ft=BN_FastText(sentences=[['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা']])ft.train()#Training Against one Text Corpusft=BN_FastText(corpus_file="path to data or txt file")ft.train()#Training Against Multiple Corpuses''' path ->Corpus ->1.txt ->2.txt ->3.txt'''ft=BN_FastText(corpus_path="path/Corpus")ft.train(epochs=25)#Training Against a Dataframe Columnft=BN_FastText(df=news_data['text_content'])ft.train(epochs=25)
训练完成后,模型“ft_model”及其支持向量文件将被保存到当前目录。在
如果不想训练而是使用预训练的模型,请在初始化BN\u FastText()时指定它。否则不需要型号名称。
Get Word Vector
fromekushey.feature_extractionimportBN_FastTextft=BN_FastText(model_name='give the model name here')ft.get_wordVector('আমার')
获取相似性
fromekushey.feature_extractionimportBN_FastTextft=BN_FastText(model_name='give the model name here')ft.get_similarity('ঢাকা','রাজধানী')
输出:
70.56821120
Get n个相似单词
^{pr31}$输出:
[('পদ্মায়', 0.8103810548782349),
('পদ্মার', 0.794012725353241),
('পদ্মানদীর', 0.7747839689254761),
('পদ্মা-মেঘনার', 0.7573559284210205),
('পদ্মা.', 0.7470568418502808),
('‘পদ্মা', 0.7413997650146484),
('পদ্মাসেতুর', 0.716225266456604),
('পদ্ম', 0.7154797315597534),
('পদ্মহেম', 0.6881639361381531),
('পদ্মাবত', 0.6682782173156738)]
Get奇数单词
从给定单词列表中找出最不匹配的单词
from"package_name"importBN_FastTextft=BN_FastText(model_name='give the model name here')ft.get_oddWords(['চাল','ডাল','চিনি','আকাশ'])
输出:
'আকাশ'
获取相似性图
创建具有概率的相似单词的条形图
^{pr35}$- 在
手套
- 培训
- 得到n个相似的单词
Training
fromekushey.feature_extractionimportBN_GloVe#Training Against Sentencesglv=BN_GloVe(sentences=[['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা']])glv.train()#Training Against one Text Corpusglv=BN_GloVe(corpus_file="path_to_corpus.txt")glv.train()#Training Against Multiple Corpuses''' path ->Corpus ->1.txt ->2.txt ->3.txt'''glv=BN_GloVe(corpus_path="path/corpus")glv.train(epochs=25)#Training Against a Dataframe Columnglv=BN_GloVe(df=news_data['text_content'])glv.train(epochs=25)
训练完成后,模型“手套模型”及其支持向量文件将被保存到当前目录。在
如果不想训练而是使用预训练的模型,请在初始化BN\u FastText()时指定它。否则不需要型号名称。
Get n个相似单词
fromekushey.feature_extraction" import BN_GloVe glv=BN_GloVe(model_name='give the model name here')glv.get_n_similarWord(['পদ্মা'],n=10)
输出:
[('পদ্মায়', 0.8103810548782349),
('পদ্মার', 0.794012725353241),
('পদ্মানদীর', 0.7747839689254761),
('পদ্মা-মেঘনার', 0.7573559284210205),
('পদ্মা.', 0.7470568418502808),
('‘পদ্মা', 0.7413997650146484),
('পদ্মাসেতুর', 0.716225266456604),
('পদ্ম', 0.7154797315597534),
('পদ্মহেম', 0.6881639361381531),
('পদ্মাবত', 0.6682782173156738)]
- 项目
标签: