未提供项目说明

ekushe的Python项目详细描述


伊库什

“Ekushey”是首个结构化且经济高效的孟加拉语自然语言处理工具包

电流模块

feature_extraction is a Bangla Natural Language Processing based feature extractor

特征提取

  1. CountVectorizer
  2. HashVectorizer
  3. TfIdf
  4. Word Embedding

安装

^{pr2}$

示例

1。计数矢量器

  • 拟合n变换
  • 转换
  • 获取单词集

Fit n转换

fromekushey.feature_extractionimportCountVectorizerct=CountVectorizer()X=ct.fit_transform(X)# X is the word features

输出:

the countVectorized matrix form of given features

Transform

fromekushey.feature_extractionimportCountVectorizerct=CountVectorizer()get_mat=ct.transform("রাহাত")

输出:

the countVectorized matrix form of given word

Get Wordset

fromekushey.feature_extractionimportCountVectorizerct=CountVectorizer()ct.get_wordSet()

输出:

get the raw wordset used in training model

2。HashVectorizer

  • 拟合n变换
  • 转换
fromekushey.feature_extractionimportHashVectorizercorpus=['আমাদের দেশ বাংলাদেশ','আমার বাংলা']Vectorizer=HashVectorizer()n_features=8X=Vectorizer.fit_transform(corpus,n_features)corpus_t=["আমাদের দেশ অনেক সুন্দর"]Xf=Vectorizer.transform(corpus_t)print(X.shape,Xf.shape)print("=====================================")print(X)print("=====================================")print(Xf)

输出:

(2, 8) (1, 8)
=====================================
  (0, 7)	-1.0
  (1, 7)	-1.0
=====================================
  (0, 0)	0.5773502691896258
  (0, 2)	0.5773502691896258
  (0, 7)	-0.5773502691896258

Get Wordset

3。TfIdf

  • 拟合n变换
  • 转换
  • 系数

Fit n转换

fromekushey.feature_extractionimportTfIdfVectorizerk=TfIdfVectorizer()doc=["কাওছার আহমেদ","শুভ হাইদার"]matrix1=k.fit_transform(doc)print(matrix1)

输出:

[[0.150515 0.150515 0.       0.      ]
 [0.       0.       0.150515 0.150515]]

Transform

fromekushey.feature_extractionimportTfIdfVectorizerk=TfIdfVectorizer()doc=["আহমেদ সুমন","কাওছার করিম"]matrix2=k.transform(doc)print(matrix2)

输出:

[[0.150515 0.       0.       0.      ]
 [0.       0.150515 0.       0.      ]]

系数

fromekushey.feature_extractionimportTfIdfVectorizerk=TfIdfVectorizer()doc=["কাওছার আহমেদ","শুভ হাইদার"]k.fit_transform(doc)wordset,idf=k.coefficients()print(wordset)#Output: ['আহমেদ', 'কাওছার', 'হাইদার', 'শুভ']print(idf)'''Output: {'আহমেদ': 0.3010299956639812, 'কাওছার': 0.3010299956639812, 'হাইদার': 0.3010299956639812, 'শুভ': 0.3010299956639812}'''

4。单词嵌入

  • Word2Vec

    • 培训
    • 获取词向量
    • 获取相似性
    • 得到n个相似的单词
    • 获取中间词
    • 得到奇怪的词
    • 求相似图

Training

fromekushey.feature_extractionimportBN_Word2Vec#Training Against Sentencesw2v=BN_Word2Vec(sentences=[['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা']])w2v.train()#Training Against one Text Corpusw2v=BN_Word2Vec(corpus_file="path_to_corpus.txt")w2v.train()#Training Against Multiple corpuses'''    path      ->corpus        ->1.txt        ->2.txt        ->3.txt'''w2v=BN_Word2Vec(corpus_path="path/corpus")w2v.train(epochs=25)#Training Against a Dataframe Columnw2v=BN_Word2Vec(df=news_data['text_content'])w2v.train(epochs=25)

训练完成后,模型“w2v_模型”及其支持向量文件将被保存到当前目录。在

如果使用任何预先训练的模型,请在初始化BN\u Word2Vec()时指定它。否则不需要型号名称。

Get Word Vector

fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_wordVector('আমার')

获取相似性

fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_similarity('ঢাকা','রাজধানী')

输出:

67.457879

Get n个相似单词

fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_n_similarWord(['পদ্মা'],n=10)

输出:

^{pr21}$

Get中间词

得到中心词给定词表的概率分布。在

fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_outputWord(['ঢাকায়','মৃত্যু'],n=2)

输出:

[("হয়েছে।',", 0.05880642), ('শ্রমিকের', 0.05639163)]

Get奇数单词

从给定单词列表中找出最不匹配的单词

fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_oddWords(['চাল','ডাল','চিনি','আকাশ'])

输出:

'আকাশ' 

获取相似性图

创建具有概率的相似单词的条形图

fromekushey.feature_extractionimportBN_Word2Vecw2v=BN_Word2Vec(model_name='give the model name here')w2v.get_similarity_plot('চাউল',5)
  • 快速文本

    • 培训
    • 获取词向量
    • 获取相似性
    • 得到n个相似的单词
    • 获取中间词
    • 得到奇怪的词

Training

fromekushey.feature_extractionimportBN_FastText#Training Against Sentencesft=ft=BN_FastText(sentences=[['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা']])ft.train()#Training Against one Text Corpusft=BN_FastText(corpus_file="path to data or txt file")ft.train()#Training Against Multiple Corpuses'''    path      ->Corpus        ->1.txt        ->2.txt        ->3.txt'''ft=BN_FastText(corpus_path="path/Corpus")ft.train(epochs=25)#Training Against a Dataframe Columnft=BN_FastText(df=news_data['text_content'])ft.train(epochs=25)

训练完成后,模型“ft_model”及其支持向量文件将被保存到当前目录。在

如果不想训练而是使用预训练的模型,请在初始化BN\u FastText()时指定它。否则不需要型号名称。

Get Word Vector

fromekushey.feature_extractionimportBN_FastTextft=BN_FastText(model_name='give the model name here')ft.get_wordVector('আমার')

获取相似性

fromekushey.feature_extractionimportBN_FastTextft=BN_FastText(model_name='give the model name here')ft.get_similarity('ঢাকা','রাজধানী')

输出:

70.56821120

Get n个相似单词

^{pr31}$

输出:

[('পদ্মায়', 0.8103810548782349),
 ('পদ্মার', 0.794012725353241),
 ('পদ্মানদীর', 0.7747839689254761),
 ('পদ্মা-মেঘনার', 0.7573559284210205),
 ('পদ্মা.', 0.7470568418502808),
 ('‘পদ্মা', 0.7413997650146484),
 ('পদ্মাসেতুর', 0.716225266456604),
 ('পদ্ম', 0.7154797315597534),
 ('পদ্মহেম', 0.6881639361381531),
 ('পদ্মাবত', 0.6682782173156738)]

Get奇数单词

从给定单词列表中找出最不匹配的单词

from"package_name"importBN_FastTextft=BN_FastText(model_name='give the model name here')ft.get_oddWords(['চাল','ডাল','চিনি','আকাশ'])

输出:

'আকাশ' 

获取相似性图

创建具有概率的相似单词的条形图

^{pr35}$
  • 手套

    • 培训
    • 得到n个相似的单词

Training

fromekushey.feature_extractionimportBN_GloVe#Training Against Sentencesglv=BN_GloVe(sentences=[['আমার','প্রিয়','জন্মভূমি'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা'],['বাংলা','আমার','মাতৃভাষা']])glv.train()#Training Against one Text Corpusglv=BN_GloVe(corpus_file="path_to_corpus.txt")glv.train()#Training Against Multiple Corpuses'''    path      ->Corpus        ->1.txt        ->2.txt        ->3.txt'''glv=BN_GloVe(corpus_path="path/corpus")glv.train(epochs=25)#Training Against a Dataframe Columnglv=BN_GloVe(df=news_data['text_content'])glv.train(epochs=25)

训练完成后,模型“手套模型”及其支持向量文件将被保存到当前目录。在

如果不想训练而是使用预训练的模型,请在初始化BN\u FastText()时指定它。否则不需要型号名称。

Get n个相似单词

fromekushey.feature_extraction" import BN_GloVe glv=BN_GloVe(model_name='give the model name here')glv.get_n_similarWord(['পদ্মা'],n=10)

输出:

[('পদ্মায়', 0.8103810548782349),
 ('পদ্মার', 0.794012725353241),
 ('পদ্মানদীর', 0.7747839689254761),
 ('পদ্মা-মেঘনার', 0.7573559284210205),
 ('পদ্মা.', 0.7470568418502808),
 ('‘পদ্মা', 0.7413997650146484),
 ('পদ্মাসেতুর', 0.716225266456604),
 ('পদ্ম', 0.7154797315597534),
 ('পদ্মহেম', 0.6881639361381531),
 ('পদ্মাবত', 0.6682782173156738)]

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Android HttpClient cookies   如何使用Java在远程系统上运行SSH命令?   java从字符串数组中的字符串末尾删除“,”   在One plus 3t手机上,当应用程序被终止或从最近的应用程序中刷出时,java Android FCM推送通知不起作用   java如何使垂直滚动条始终位于jtable的末尾   在java中解析迄今为止“未知”的字符串   javascript在Java中获取Nashorn JsonObject   java windows 10和ubuntu可以使用相同的JDK吗?   java在不同的文件中记录不同的日志。但所有日志都放在同一个文件中   具有特定jdk的java Gradle构建项目   xml Java web服务生成错误响应   javascript Jaggery文件更改不显示   java输出二进制搜索树数组   将BufferedReader解析为JSON对象时,java在位置处意外标记文件结尾