使用NTLK通过Python迭代wordnet3.0中的synsets时丢失了“Instance of”关系

from nltk.corpus import wordnet as wn stack = [] duplicate_check = [] def iterate_all(): while(stack): current_node = stack.pop() print current_node,"on top" for hypo in current_node.hyponyms(): stack.append(hypo) duplicate_check.append(hypo) if __name__ == "__main__": root = wn.synset("entity.n.01") stack.append(root) duplicate_check.append(root) iterate_all() correct_list = list(wn.all_synsets('n')) # print list( set(correct_list) - set(duplicate_check) ) print len(correct_list) print len(duplicate_check)

1条回答

网友

1楼 · 发布于 2024-05-15 18:28:06

首先，不需要从entity.n.01自上而下迭代以获得其下义词，您只需检查所有synset中的root_hypernymsbotton：

>>> from nltk.corpus import wordnet as wn
>>> len(set(wn.all_synsets('n')))
82115
>>> entity = wn.synset('entity.n.01')
>>> len([i for i in wn.all_synsets('n') if entity in i.root_hypernyms()])
82115

以下是Synset.root_hypernyms()如何工作的代码，来自https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L439：

^{pr2}$

还有另一种方法可以访问超/下义词，但似乎没有NLTK中的那样完美，参见How to get all the hyponyms of a word/synset in python nltk and wordnet?：

>>> len(set([s for s in entity.closure(lambda s:s.hyponyms())]))
74373

要单独迭代：

>>> for s in entity.closure(lambda s:s.hyponyms()):
...     print s

所以我们试着自下而上：

>>> from nltk.corpus import wordnet as wn
>>> 
>>> synsets_with_entity_root = 0
>>> entity = wn.synset('entity.n.01')
>>> 
>>> for i in wn.all_synsets('n'):
...     # Get root hypernym the hard way.
...     x = set([s for s in i.closure(lambda s:s.hypernyms())])
...     if entity in x:
...             synsets_with_entity_root +=1
... 

>>> print synsets_with_entity_root
74373

似乎在自下而上和自上而下分析超上下义树时，我们缺少约8000个语法集，因此我们检查：

entity = wn.synset('entity.n.01')

for i in wn.all_synsets('n'):
    # Get root hypernym the hard way.
    x = set([s for s in i.closure(lambda s:s.hypernyms())])
    if entity in x:
        synsets_with_entity_root +=1
    else:
        print i, i.root_hypernyms()

您将得到丢失的~8000个synset的列表，下面是您将看到的前几个：

Synset('entity.n.01') [Synset('entity.n.01')]
Synset('hegira.n.01') [Synset('entity.n.01')]
Synset('underground_railroad.n.01') [Synset('entity.n.01')]
Synset('babylonian_captivity.n.01') [Synset('entity.n.01')]
Synset('creation.n.05') [Synset('entity.n.01')]
Synset('berlin_airlift.n.01') [Synset('entity.n.01')]
Synset('secession.n.02') [Synset('entity.n.01')]
Synset('human_genome_project.n.01') [Synset('entity.n.01')]
Synset('manhattan_project.n.02') [Synset('entity.n.01')]
Synset('peasant's_revolt.n.01') [Synset('entity.n.01')]
Synset('first_crusade.n.01') [Synset('entity.n.01')]
Synset('second_crusade.n.01') [Synset('entity.n.01')]
Synset('third_crusade.n.01') [Synset('entity.n.01')]
Synset('fourth_crusade.n.01') [Synset('entity.n.01')]
Synset('fifth_crusade.n.01') [Synset('entity.n.01')]
Synset('sixth_crusade.n.01') [Synset('entity.n.01')]
Synset('seventh_crusade.n.01') [Synset('entity.n.01')]

因此，closure()方法可能有点损失，但如果您不关心确切的数字，它仍然是一个优雅的方法。在

相关问题更多 >

编程相关推荐

热门问题

热门文章