如何在nltk中用hunpos标记文本文件?
有人能帮我解决在nltk中使用hunpos标记语料库的语法问题吗?
我需要导入什么来使用
hunpos.HunPosTagger
模块?我该如何对语料库进行HunPos标记?请看下面的代码。
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.util import LazyCorpusLoader
corpus_root = './'
reader = PlaintextCorpusReader (corpus_root, '.*')
ntuen = LazyCorpusLoader ('ntumultien', PlaintextCorpusReader, reader)
ntuen.fileids()
isinstance (ntuen, PlaintextCorpusReader)
# So how do I hunpos tag `ntuen`? I can't get the following code to work.
# please help me to correct my python syntax errors, I'm new to python
# but i really need this to work. sorry
##from nltk.tag import hunpos.HunPosTagger
ht = HunPosTagger('english.model')
for sentence in ntu.sent() ##looping through the no. of sentence
ht.tag(ntusent()[i])
1 个回答
5
import nltk
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize
corpus = "so how do i hunpos tag my ntuen ? i can't get the following code to work."
#please help me to correct my python syntax errors, i'm new to python
#but i really need this to work. sorry
##from nltk.tag import hunpos.HunPosTagger
ht = HunposTagger('en_wsj.model')
print ht.tag(word_tokenize(corpus))
我觉得问题在于你没有把单词进行分词处理,但代码不工作的原因可能还有其他方面(它是HunposTagger,不是HunPosTagger)。我根据你的问题做了一个简化的例子。如果你还有其他问题,请留言。
我所有的信息都来自这里: http://code.google.com/p/hunpos/
python hunpos.py
[('so', 'RB'), ('how', 'WRB'), ('do', 'VBP'), ('i', 'FW'), ('hunpos', 'NN'), ('tag', 'NN'), ('my', 'PRP$'), ('ntuen', 'NN'), ('?', '.'), ('i', 'FW'), ('ca', 'MD'), ("n't", 'RB'), ('get', 'VB'), ('the', 'DT'), ('following', 'JJ'), ('code', 'NN'), ('to', 'TO'), ('work', 'VB'), ('.', '.')]