python中的wordnet引理与词性标注

网友

1楼 · 编辑于 2024-06-01 01:47:34

首先，您可以直接使用nltk.pos_tag()，而无需进行培训。函数将从文件中加载一个预先训练的标记器。你可以看到文件名使用nltk.tag._POS_TAGGER：

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'

由于它是用Treebank语料库训练的，所以它也使用Treebank tag set。

以下函数将树库标记映射到WordNet部分语音名称：

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

然后，您可以将返回值与lemmatizer一起使用：

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

在将返回值传递给Lemmatizer之前，请检查它，因为空字符串将给出KeyError。

网友

2楼 · 编辑于 2024-06-01 01:47:34

您可以使用python默认dict创建一个映射，并利用这样一个事实：对于lemmatizer，默认标记是Noun。

from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "Another way of achieving this task"
tokens = word_tokenize(text)
lmtzr = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)

网友

3楼 · 编辑于 2024-06-01 01:47:34

在nltk.corpus.reader.wordnet的源代码中（http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html）

#{ Part-of-speech constants
 ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#}
POS_LIST = [NOUN, VERB, ADJ, ADV]

相关问题更多 >

编程相关推荐

热门问题

热门文章

python中的wordnet引理与词性标注

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >