我一直在尝试使用HMM实现一个简单的POS标记器,并得出以下代码。在
import nltk
from nltk.corpus import treebank
train_data = treebank.tagged_sents()[:3000]
print train_data[0]
# [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), ... ]
from nltk.tag import hmm
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)
print tagger
print tagger.tag("Alex was born in Connecticut .".split())
# [('Alex', u'NNP'), ('was', u'NNP'), ('born', u'NNP'), ('in', u'NNP'), ('Connecticut', u'NNP'), ('.', u'NNP')]
print tagger.tag("Joe met Joanne in Delhi .".split())
# [('Joe', u'NNP'), ('met', u'VBD'), ('Joanne', u'NNP'), ('in', u'IN'), ('Delhi', u'NNP'), ('.', u'NNP')]
print tagger.tag("Chicago is the birthplace of Ginny".split())
# [('Chicago', u'NNP'), ('is', u'VBZ'), ('the', u'DT'), ('birthplace', u'NNP'), ('of', u'NNP'), ('Ginny', u'NNP')]
正如你所看到的(许多)标签几乎是关闭的。为什么会这样?我觉得火车组够大的了:|。。。?在
另外,当我运行tagger.evaluate(treebank.tagged_sents()[3000:])
时,只有一个0.3与黄金标准匹配
也发布了here:
目前没有回答
相关问题 更多 >
编程相关推荐