在NLTK中HMM标记是不准确的

2024-05-16 07:18:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直在尝试使用HMM实现一个简单的POS标记器,并得出以下代码。在

 import nltk
 from nltk.corpus import treebank

train_data = treebank.tagged_sents()[:3000]

print train_data[0]
# [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), ... ]

from nltk.tag import hmm

trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)

print tagger

print tagger.tag("Alex was born in Connecticut .".split())
# [('Alex', u'NNP'), ('was', u'NNP'), ('born', u'NNP'), ('in', u'NNP'), ('Connecticut', u'NNP'), ('.', u'NNP')]

print tagger.tag("Joe met Joanne in Delhi .".split())
# [('Joe', u'NNP'), ('met', u'VBD'), ('Joanne', u'NNP'), ('in', u'IN'), ('Delhi', u'NNP'), ('.', u'NNP')]

print tagger.tag("Chicago is the birthplace of Ginny".split())
# [('Chicago', u'NNP'), ('is', u'VBZ'), ('the', u'DT'), ('birthplace', u'NNP'), ('of', u'NNP'), ('Ginny', u'NNP')]

正如你所看到的(许多)标签几乎是关闭的。为什么会这样?我觉得火车组够大的了:|。。。?在

另外,当我运行tagger.evaluate(treebank.tagged_sents()[3000:])时,只有一个0.3与黄金标准匹配

也发布了here


Tags: infromimportdatatagtraintaggersplit