Python NLTK 使用WordNet对单词'further'进行词形还原

3 投票

1 回答

6006 浏览

提问于 2025-04-18 02:16

我正在用Python、NLTK和WordNetLemmatizer做一个词形还原器。这里有一段随机文本，输出的结果正是我期待的。

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

输出：'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

输出：'worse'

好吧，这里的一切都正常。对于其他形容词，比如'better'（这是一个不规则形式）或'older'，它们的表现也是一样的（注意，使用'elder'进行同样的测试时，永远不会输出'old'，不过我想WordNet并不是所有英语单词的完整列表）。

我的问题出现在尝试使用单词'furter'时：

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

输出：'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

输出：'far'

这和'worse'的表现完全相反！

有没有人能告诉我这是为什么？是WordNet的同义词数据有问题，还是我对英语语法的理解有误？

如果这个问题已经有人回答过，请见谅。我在谷歌和StackOverflow上搜索过，但当我指定关键词“further”时，找到的相关内容都很杂乱，因为这个词太常用了……

提前谢谢你们，
罗曼·G。

nltk natural language processing Wordnet text analysis lemmatization synonym data irregular forms english grammar

1 个回答

WordNetLemmatizer 使用 ._morphy 函数来获取一个单词的基本形式（也叫词根）；这个函数会返回可能的基本形式中最短的一个。

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

而 ._morphy 函数会不断应用一些规则来得到一个词根；这些规则会逐步缩短单词的长度，并用 MORPHOLOGICAL_SUBSTITUTIONS 替换掉一些前后缀。接着，它会检查是否有其他更短的单词与缩短后的单词相同：

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

不过，如果这个单词在一个例外列表中，它就会返回一个固定的值，这个值保存在 exceptions 中，具体可以查看 _load_exception_map，链接在这里：http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html：

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到你的例子，worse 变成 bad 和 further 变成 far 是无法通过这些规则实现的，因此它们必须来自例外列表。由于这是一个例外列表，里面肯定会有不一致的情况。

这个例外列表保存在 ~/nltk_data/corpora/wordnet/adv.exc 和 ~/nltk_data/corpora/wordnet/adj.exc 中。

来自 adv.exc 的内容：

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

来自 adj.exc 的内容：

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

回答于 2025-04-18 由 Python大师

分享举报

Python NLTK 使用WordNet对单词'further'进行词形还原

1 个回答

撰写回答