NLTK西班牙语词性标注结果很差？

Question

我正在尝试为西班牙语创建一个标记器性能比较。我的当前脚本是对这个脚本的修改版本，虽然我也尝试了另一个版本，结果非常相似。

我使用的是cess_esp语料库，并且为这个语料库创建了单字（Unigram）、双字（Bigram）、三字（Trigram）和Brill标记器，都是用带标签的句子来训练每个标记器。

我对双字和三字标记器的表现感到担忧……从结果来看，它们似乎根本没有起作用。

例如，以下是我脚本的一些输出：

*************** START TAGGING FOR LINE 6 ****************************************************************************************************************************************

Current line contents before tagging-> mejor ve a la sucursal de Juan Pablo II es la que menos gente tiene y no te tardas nada

Unigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', 'aq0cs0'), ('ve', 'vmip3s0'), ('a', 'sps00'), ('la', 'da0fs0'), ('sucursal', 'ncfs000'), ('de', 'sps00'), ('Juan', 'np0000p'), ('Pablo', None), ('II', None), ('es', 'vsip3s0'), ('la', 'da0fs0'), ('que', 'pr0cn000'), ('menos', 'rg'), ('gente', 'ncfs000'), ('tiene', 'vmip3s0'), ('y', 'cc'), ('no', 'rn'), ('te', 'pp2cs000'), ('tardas', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]

Trigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]
****************************************************************************************************************************************

*************** START TAGGING FOR LINE 7 ****************************************************************************************************************************************

Current line contents before tagging-> He levantado ya varios reporte pero no resuelven nada

Unigram tagger-> [('He', 'vaip1s0'), ('levantado', 'vmp00sm'), ('ya', 'rg'), ('varios', 'di0mp0'), ('reporte', 'vmsp1s0'), ('pero', 'cc'), ('no', 'rn'), ('resuelven', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

Trigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

*************** START TAGGING FOR LINE 8 ****************************************************************************************************************************************

Current line contents before tagging-> Es lamentable el servicio que brindan

Unigram tagger-> [('@ContactoBanamex', None), ('Es', 'vsip3s0'), ('lamentable', 'aq0cs0'), ('el', 'da0ms0'), ('servicio', 'ncms000'), ('que', 'pr0cn000'), ('brindan', None)]

Bigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

Trigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

现在双字和三字标记器正在按照指定的链接进行训练，顺便说一下，这也是《NLTK书》中描述的最直接的方法：

from nltk.corpus import cess_esp as cess
from nltk import BigramTagger as bt
from nltk import TrigramTagger as tt
cess_sents = cess.tagged_sents()
# Training BigramTagger.
bi_tag = bt(cess_sents)
#Training TrigramTagger
tri_tag = tt(cess_sents)

你觉得我是不是漏掉了什么？双字和三字标记器不应该比单字标记器表现更好吗？我是否应该总是为双字和三字标记器使用回退标记器？

谢谢！
亚历杭德罗

nltk 词性标注语言处理语料库西班牙语标记器 bigram trigram unigram

NLTK西班牙语词性标注结果很差？

2 个回答

撰写回答