中日形态分析仪(分词器+词性标记器)
rakutenma的Python项目详细描述
乐天麻蟒
Rakuten MA Python(形态分析器)是Rakuten MA(word segmentor+PoS Tagger)的Python版本,用于中文和日语
有关Rakuten MA的详细信息,请参见https://github.com/rakuten-nlp/rakutenma
另见http://qiita.com/yukinoi/items/925bc238185aa2fad8a7(日语)
欢迎投稿!
安装
pip install rakutenma
示例
fromrakutenmaimportRakutenMA# Initialize a RakutenMA instance with an empty model# the default ja feature set is set alreadyrma=RakutenMA()# Let's analyze a sample sentence (from http://tatoeba.org/jpn/sentences/show/103809)# With a disastrous result, since the model is empty!print(rma.tokenize("彼は新しい仕事できっと成功するだろう。"))# Feed the model with ten sample sentences from tatoeba.com# "tatoeba.json" is available at https://github.com/rakuten-nlp/rakutenmaimportjsontatoeba=json.load(open("tatoeba.json"))foriintatoeba:rma.train_one(i)# Now what does the result look like?print(rma.tokenize("彼は新しい仕事できっと成功するだろう。"))# Initialize a RakutenMA instance with a pre-trained modelrma=RakutenMA(phi=1024,c=0.007812)# Specify hyperparameter for SCW (for demonstration purpose)rma.load("model_ja.json")# Set the feature hash function (15bit)rma.hash_func=rma.create_hash_func(15)# Tokenize one sample sentenceprint(rma.tokenize("うらにわにはにわにわとりがいる"));# Re-train the model feeding the right answer (pairs of [token, PoS tag])res=rma.train_one([["うらにわ","N-nc"],["に","P-k"],["は","P-rj"],["にわ","N-n"],["にわとり","N-nc"],["が","P-k"],["いる","V-c"]])# The result of train_one contains:# sys: the system output (using the current model)# ans: answer fed by the user# update: whether the model was updatedprint(res)# Now what does the result look like?print(rma.tokenize("うらにわにはにわにわとりがいる"))
注
添加了api
与原乐天相比,增加了以下方法:
- RakutenMA::加载(模型路径) -从json文件加载模型
- rakutenma::save(模型路径) -将模型保存到路径
其他
作为初始设置,将设置以下值:
- rma.featset=CTYPE_JA_PATTERNS#RakutenMA.default_featset_JA
- rma.hash_func=rma.create_hash_func(15个)
- rma.tag_scheme=“sbieo”如果使用中文,请设置“iob2”
许可证
apache许可证2.0版
版权
乐天麻蟒 (c)2015年——Yukino Ikegami。保留所有权利
马乐天(原件) (c)2014年乐天NLP项目。保留所有权利。
更改
0.3.3(2017-05-22)
- 关于培训的错误修复
0.3.2(2017-02-01)
- 尽可能使用ujson
- 启用POS到MECAB样式
- 支持Python3.5和3.6
0.3(2016-04-10)
- 加上崔(乐天)
0.2.2(2016-04-09)
- 捆绑模型文件(model_ja.json、model_ja_min.json)
- 支持Windows
0.2(2015-01-10)
- 支持Python2.6和2.7
0.1.1(2015-01-08)
- 性能略有提高
0.1(2015-01-01)
- 第一次释放。