用语料库标注西班牙语单词

3条回答

网友

1楼 · 编辑于 2024-05-20 04:37:38

首先，您需要从语料库中读取标记的句子。NLTK提供了一个很好的界面，不用担心来自不同语料库的不同格式；您只需使用语料库对象函数来访问数据即可导入语料库。见http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml。

然后您必须选择标记器并训练标记器。有更多花哨的选择，但你可以从N-gram标记开始。

然后你可以用标记器来标记你想要的句子。下面是一个示例代码：

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

在一个大的语料库上训练一个标记可能需要很长时间。不是每次需要时都训练一个标记器，而是将一个经过训练的标记器保存在一个文件中以便以后重用。

请查看http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html中存储标记符的部分

网友
2楼 · 编辑于 2024-05-20 04:37:38

下面的脚本为您提供了一个快速的方法来获取西班牙语句子中的“单词包”。请注意，如果要正确执行此操作，必须在标记前标记句子，因此“religiosas.”必须用两个标记“religiosas”分隔
#-*- coding: utf8 -*- # about the tagger: http://nlp.stanford.edu/software/tagger.shtml # about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html import nltk from nltk.tag.stanford import POSTagger spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8') sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.'] for sent in sentences: words = sent.split() tagged_words = spanish_postagger.tag(words) nouns = [] for (word, tag) in tagged_words: print(word+' '+tag).encode('utf8') if isNoun(tag): nouns.append(word) print(nouns)
给出：
El da0000 copal nc0s000 se p0000000 usa vmip000 principalmente rg para sp000 sahumar vmn0000 en sp000 distintas di0000 ocasiones nc0p000 como cs lo pp000000 son vsip000 las da0000 fiestas nc0p000 religiosas. np00000 [u'copal', u'ocasiones', u'fiestas', u'religiosas.'] Las da0000 flores, np00000 hojas nc0p000 y cc frutos nc0p000 se p0000000 usan vmip000 para sp000 aliviar vmn0000 la da0000 tos nc0s000 y cc también rg se p0000000 emplea vmip000 como cs sedante. nc0s000 [u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']

网友
3楼 · 编辑于 2024-05-20 04:37:38

根据前面答案中的教程，这里有一个来自意大利面条标记器的更面向对象的方法：https://github.com/alvations/spaghetti-tagger

#-*- coding: utf8 -*-

from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load

def loadtagger(taggerfilename):
    infile = open(taggerfilename,'rb')
    tagger = load(infile); infile.close()
    return tagger

def traintag(corpusname, corpus):
    # Function to save tagger.
    def savetagger(tagfilename,tagger):
        outfile = open(tagfilename, 'wb')
        dump(tagger,outfile,-1); outfile.close()
        return
    # Training UnigramTagger.
    uni_tag = ut(corpus)
    savetagger(corpusname+'_unigram.tagger',uni_tag)
    # Training BigramTagger.
    bi_tag = bt(corpus)
    savetagger(corpusname+'_bigram.tagger',bi_tag)
    print "Tagger trained with",corpusname,"using" +\
                "UnigramTagger and BigramTagger."
    return

# Function to unchunk corpus.
def unchunk(corpus):
    nomwe_corpus = []
    for i in corpus:
        nomwe = " ".join([j[0].replace("_"," ") for j in i])
        nomwe_corpus.append(nomwe.split())
    return nomwe_corpus

class cesstag():
    def __init__(self,mwe=True):
        self.mwe = mwe
        # Train tagger if it's used for the first time.
        try:
            loadtagger('cess_unigram.tagger').tag(['estoy'])
            loadtagger('cess_bigram.tagger').tag(['estoy'])
        except IOError:
            print "*** First-time use of cess tagger ***"
            print "Training tagger ..."
            from nltk.corpus import cess_esp as cess
            cess_sents = cess.tagged_sents()
            traintag('cess',cess_sents)
            # Trains the tagger with no MWE.
            cess_nomwe = unchunk(cess.tagged_sents())
            tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
            traintag('cess_nomwe',tagged_cess_nomwe)
            print
        # Load tagger.
        if self.mwe == True:
            self.uni = loadtagger('cess_unigram.tagger')
            self.bi = loadtagger('cess_bigram.tagger')
        elif self.mwe == False:
            self.uni = loadtagger('cess_nomwe_unigram.tagger')
            self.bi = loadtagger('cess_nomwe_bigram.tagger')

def pos_tag(tokens, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.tag(tokens)

def batch_pos_tag(sentences, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.batch_tag(sentences)

tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())

相关问题更多 >

编程相关推荐

热门问题

热门文章