使用txt文件作为输入训练NLTK Brill标注器

Question

大家好。我现在正在做我的毕业设计，题目是“使用Brill标注器的马来语词性标注器”。

我想问一下，如何训练我保存在txt文件中的标注句子？输入应该是txt文件，然后用Brill标注器进行训练。之后，我会用一个txt文件作为测试数据。但是，我在训练这部分遇到了困难。你能帮我吗？

这是我写的一些代码。

import nltk  
f = open('gayahidupsihat_tagged.txt')  
malay_tagged = f.read()   

def train_brill_tagger(train_data):
    # Modules for creating the templates.
    from nltk.tag import UnigramTagger
    from nltk.tag.brill import SymmetricProximateTokensTemplate, ProximateTokensTemplate
    from nltk.tag.brill import ProximateTagsRule, ProximateWordsRule
    # The brill tagger module in NLTK.
    from nltk.tag.brill import FastBrillTaggerTrainer
    unigram_tagger = UnigramTagger(train_data)
    templates = [SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (1,2)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (1,3)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,1)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (2,2)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,3)),
                 ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1,1)),
                 ProximateTokensTemplate(ProximateWordsRule, (-1, -1), (1,1))]

    trainer = FastBrillTaggerTrainer(initial_tagger=unigram_tagger,
                                   templates=templates, trace=3,
                                   deterministic=True)
    brill_tagger = trainer.train(train_data, max_rules=10)
    print
    return brill_tagger    

malay_train = (malay_tagged[:10]) 
malay_test = (malay_tagged[10:15]) 
malay20 = malay_tagged[20]

mt = train_brill_tagger(malay_train)    
print mt.tag(malay20)

其实，我想训练一个标注过的段落，之后再用另一个段落进行测试。最后，我会用标注过的句子来评估Brill标注器的效果。

举个例子：

我训练这个文件（gayahidupsihat_train.txt）——输入实际上是一行：

Gaya\NN hidup\NN sihat\VB boleh\MD lah\UH ditakrifkan\VBZ sebagai\DT
satu\CD amalan\VBZ kehidupan\NN yang\DT membawa\VBZ impak\NN positif\NN
kepada\TO diri\NN seseorang\NN ,\, keluarganya\NN dan\CC masyarakat\NN.
Antara\IN contoh\NN kehidupan\NN yang\DT sihat\VB ialah\DT individu\NN
tersebut\EX hidup\VB dengan\DT penuh\RB ceria\RB tanpa\NN mengalami\VBZ
sebarang\NN masalah\NN yang\DT boleh\MD menjejaskan\VBZ kehidupannya\NN
untuk\TO satu\CD tempoh\NN tertentu\EX pula\DT .\. Sudah\EX pasti\RB
dalam\DT kehidupan\NN era\NN moden\NN yang\DT begitu\DT banyak\RB
tekanan\VB ini\DT gaya\NN hidup\NN sihat\VB menjadi\VBZ satu\NUM
matlamat\NN yang\DT perlu\MD dicapai\VBZ segera\VB. Oleh\PDT itu\DT ,\,
terdapat\EX pelbagai\NN tindakan\VBZ yang\DT boleh\MD dilakukan\VBZ
untuk\TO mencapai\VBZ matlamat\NN ini\DT .\.

然后我想用这个文件（gayahidupsihat_test.txt）进行测试：

Tindakan\VBP awal\VB ialah\DT seseorang\NN itu\DT perlu\MD
mengamalkan\VBD satu\CD bentuk\NN pemakanan\NN yang\DT seimbang\NN
dalam\IN kehidupannya\VBZ .\.Dalam\IN keadaan\NN kehidupan\NN sebenar\JJ
,\, orang\NN ramai\JJ lebih\JJR suka\VB mengambil\VBZ makanan\NN yang\DT
bersifat\VBZ mudah\JJ seperti\DT mengamalkan\VBZ pengambilan\VBD makanan\NN
ringan\JJ ataupun\CC makanan\NN segera\NN .\. TidaK\DT kurang\JJR juga\DT
masyarakat\NN kita\PRP hari\NN ini\DT yang\DT lupa\VB kesan\NN pengambilan\VBZ
makanan\NN berlemak\JJR ataupun\CC makanan\NN yang\DT mempunyai\VBZ
kandungan\NN garam\NN ,\. gula\NN atau\DT sodium\FW glutamit\FW yang\DT
tinggi\JJ .\. Hal\IN ini\DT boleh\MD mendatangkan\VBZ pelbagai\NN penyakit\NN
kronik\JJ seperti\DT sakit\JJ jantung\NN ,\, darah\NN tinggi\JJ
ataupun\CC kencing\NN manis\JJ yang\DT juga\DT menjadi\MD punca\NN kematian\NN
tertinggi\JJS di\IN negara\NN kita\PRP .\.

之后，我会用一些tagged_words来尝试标注器并进行评估。

英文版本的输出是这样的：

Training Brill tagger on 500 sentences...
Finding initial useful rules...
Found 10210 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  46  46   0   0  | TO -> IN if the tag of the following word is 'AT'
  18  20   2   0  | TO -> IN if the tag of words i+1...i+3 is 'CD'
  14  14   0   0  | IN -> IN-TL if the tag of the preceding word is
                  |   'NN-TL', and the tag of the following word is
                  |   'NN-TL'
  11  11   0   1  | TO -> IN if the tag of the following word is 'NNS'
  10  10   0   0  | TO -> IN if the tag of the following word is 'JJ'
   8   8   0   0  | , -> ,-HL if the tag of the preceding word is 'NP-
                  |   HL'
   7   7   0   1  | NN -> VB if the tag of the preceding word is 'MD'
   7  13   6   0  | NN -> VB if the tag of the preceding word is 'TO'
   7   7   0   0  | NP-TL -> NP if the tag of words i+1...i+2 is 'NNS'
   7   7   0   0  | VBN -> VBD if the tag of the preceding word is
                  |   'NP'`

测试数据机器学习 nltk 词性标注 brill tagger 马来语文本训练标注句子

使用txt文件作为输入训练NLTK Brill标注器

1 个回答

撰写回答