使用NTLK通过Python迭代wordnet3.0中的synsets时丢失了“Instance of”关系

2024-05-15 18:28:06 发布

您现在位置:Python中文网/ 问答频道 /正文

由于某些原因,我需要迭代WordNet3.0中的所有名词语法集,并在我的程序中使它们成为树状结构。在

但是当我用下面列出的代码来做这个的时候

from nltk.corpus import wordnet as wn
stack = []
duplicate_check = []
def iterate_all():
    while(stack):
        current_node = stack.pop()
        print current_node,"on top"
        for hypo in current_node.hyponyms():
            stack.append(hypo)
            duplicate_check.append(hypo)
if __name__ == "__main__":
    root = wn.synset("entity.n.01")
    stack.append(root)
    duplicate_check.append(root)
    iterate_all()
    correct_list = list(wn.all_synsets('n'))
#    print list( set(correct_list) - set(duplicate_check) )
    print len(correct_list)
    print len(duplicate_check)

我得到了96308条duplicate_check的记录,而^{中有82115条。后者,correct_list包含正确数量的语法集,但不包含{}

在将两个列表转换为set并检查两个列表中的元素之后,我发现我会因为上面列出的代码而丢失名词关系中“instance of”的关系。有人能告诉我:

(1)WordNet3.0中的“下义词”关系是否等于“实例”?在

(2)我的代码中是否有错误导致我不能在duplicate_list中添加“instance of”关系词?在

非常感谢您抽出时间。在

环境: Ubuntu 14.04+Python2.7+NLTK最新版本+WordNet 3.0


Tags: 代码node关系stackcheckrootcurrentall
1条回答
网友
1楼 · 发布于 2024-05-15 18:28:06

首先,不需要从entity.n.01自上而下迭代以获得其下义词,您只需检查所有synset中的root_hypernymsbotton:

>>> from nltk.corpus import wordnet as wn
>>> len(set(wn.all_synsets('n')))
82115
>>> entity = wn.synset('entity.n.01')
>>> len([i for i in wn.all_synsets('n') if entity in i.root_hypernyms()])
82115

以下是Synset.root_hypernyms()如何工作的代码,来自https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L439

^{pr2}$

还有另一种方法可以访问超/下义词,但似乎没有NLTK中的那样完美,参见How to get all the hyponyms of a word/synset in python nltk and wordnet?

>>> len(set([s for s in entity.closure(lambda s:s.hyponyms())]))
74373

要单独迭代:

>>> for s in entity.closure(lambda s:s.hyponyms()):
...     print s

所以我们试着自下而上:

>>> from nltk.corpus import wordnet as wn
>>> 
>>> synsets_with_entity_root = 0
>>> entity = wn.synset('entity.n.01')
>>> 
>>> for i in wn.all_synsets('n'):
...     # Get root hypernym the hard way.
...     x = set([s for s in i.closure(lambda s:s.hypernyms())])
...     if entity in x:
...             synsets_with_entity_root +=1
... 

>>> print synsets_with_entity_root
74373

似乎在自下而上和自上而下分析超上下义树时,我们缺少约8000个语法集,因此我们检查:

entity = wn.synset('entity.n.01')

for i in wn.all_synsets('n'):
    # Get root hypernym the hard way.
    x = set([s for s in i.closure(lambda s:s.hypernyms())])
    if entity in x:
        synsets_with_entity_root +=1
    else:
        print i, i.root_hypernyms()

您将得到丢失的~8000个synset的列表,下面是您将看到的前几个:

Synset('entity.n.01') [Synset('entity.n.01')]
Synset('hegira.n.01') [Synset('entity.n.01')]
Synset('underground_railroad.n.01') [Synset('entity.n.01')]
Synset('babylonian_captivity.n.01') [Synset('entity.n.01')]
Synset('creation.n.05') [Synset('entity.n.01')]
Synset('berlin_airlift.n.01') [Synset('entity.n.01')]
Synset('secession.n.02') [Synset('entity.n.01')]
Synset('human_genome_project.n.01') [Synset('entity.n.01')]
Synset('manhattan_project.n.02') [Synset('entity.n.01')]
Synset('peasant's_revolt.n.01') [Synset('entity.n.01')]
Synset('first_crusade.n.01') [Synset('entity.n.01')]
Synset('second_crusade.n.01') [Synset('entity.n.01')]
Synset('third_crusade.n.01') [Synset('entity.n.01')]
Synset('fourth_crusade.n.01') [Synset('entity.n.01')]
Synset('fifth_crusade.n.01') [Synset('entity.n.01')]
Synset('sixth_crusade.n.01') [Synset('entity.n.01')]
Synset('seventh_crusade.n.01') [Synset('entity.n.01')]

因此,closure()方法可能有点损失,但如果您不关心确切的数字,它仍然是一个优雅的方法。在

相关问题 更多 >