Python NLTK 使用WordNet对单词'further'进行词形还原

3 投票
1 回答
6006 浏览
提问于 2025-04-18 02:16

我正在用Python、NLTK和WordNetLemmatizer做一个词形还原器。这里有一段随机文本,输出的结果正是我期待的。

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

输出:'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

输出:'worse'

好吧,这里的一切都正常。对于其他形容词,比如'better'(这是一个不规则形式)或'older',它们的表现也是一样的(注意,使用'elder'进行同样的测试时,永远不会输出'old',不过我想WordNet并不是所有英语单词的完整列表)。

我的问题出现在尝试使用单词'furter'时:

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

输出:'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

输出:'far'

这和'worse'的表现完全相反!

有没有人能告诉我这是为什么?是WordNet的同义词数据有问题,还是我对英语语法的理解有误?

如果这个问题已经有人回答过,请见谅。我在谷歌和StackOverflow上搜索过,但当我指定关键词“further”时,找到的相关内容都很杂乱,因为这个词太常用了……

提前谢谢你们,
罗曼·G。

1 个回答

5

WordNetLemmatizer 使用 ._morphy 函数来获取一个单词的基本形式(也叫词根);这个函数会返回可能的基本形式中最短的一个。

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

._morphy 函数会不断应用一些规则来得到一个词根;这些规则会逐步缩短单词的长度,并用 MORPHOLOGICAL_SUBSTITUTIONS 替换掉一些前后缀。接着,它会检查是否有其他更短的单词与缩短后的单词相同:

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

不过,如果这个单词在一个例外列表中,它就会返回一个固定的值,这个值保存在 exceptions 中,具体可以查看 _load_exception_map,链接在这里:http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到你的例子,worse 变成 badfurther 变成 far 是无法通过这些规则实现的,因此它们必须来自例外列表。由于这是一个例外列表,里面肯定会有不一致的情况。

这个例外列表保存在 ~/nltk_data/corpora/wordnet/adv.exc~/nltk_data/corpora/wordnet/adj.exc 中。

来自 adv.exc 的内容:

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

来自 adj.exc 的内容:

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

撰写回答