使用NLTK提取关系
这是我之前问题的后续。我正在使用nltk这个工具来分析文本中的人名、组织名以及它们之间的关系。参考这个例子,我成功地把人名和组织名分成了几个部分;不过,我在使用nltk.sem.extract_rel这个命令时遇到了错误:
AttributeError: 'Tree' object has no attribute 'text'
这是完整的代码:
import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)
# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]
# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
print nltk.sem.show_raw_rtuple(rel)
这个例子和书中给的例子非常相似,但书中的例子使用了准备好的“解析文档”,而我不知道这些文档从哪里来,也不知道它们的对象类型是什么。我也查阅了git库。任何帮助都非常感谢。
我的最终目标是提取一些公司的人员、组织和职称(日期),然后创建人员和组织的网络图。
4 个回答
0
这是一个关于nltk版本的问题。你的代码在nltk 2.x版本中应该能正常运行,但在nltk 3版本中,你需要这样写代码。
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
print (nltk.sem.relextract.rtuple(rel))
5
这里是nltk.sem.extract_rels函数的源代码:
def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
"""
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').
:param subjclass: the class of the subject Named Entity.
:type subjclass: str
:param objclass: the class of the object Named Entity.
:type objclass: str
:param doc: input document
:type doc: ieer document or a list of chunk trees
:param corpus: name of the corpus to take as input; possible values are
'ieer' and 'conll2002'
:type corpus: str
:param pattern: a regular expression for filtering the fillers of
retrieved triples.
:type pattern: SRE_Pattern
:param window: filters out fillers which exceed this threshold
:type window: int
:return: see ``mk_reldicts``
:rtype: list(defaultdict)
"""
....
所以,如果你把corpus参数设置为ieer,那么nltk.sem.extract_rels函数就会期待doc参数是一个IEERDocument对象。你应该把corpus设置为ace,或者干脆不设置(默认就是ace)。在这种情况下,它会期待一个块树的列表(这正是你想要的)。我把代码修改成了下面这样:
import nltk
import re
from nltk.sem import extract_rels,rtuple
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read().decode('utf-8')
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
# here i changed reg ex and below i exchanged subj and obj classes' places
OF = re.compile(r'.*\bof\b.*')
for i, sent in enumerate(tagged_sentences):
sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence
for rel in rels:
print('{0:<5}{1}'.format(i, rtuple(rel)))
然后它给出了结果:
[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']
6
要让一个对象变成“解析文档”,它需要有两个部分:一个叫做 headline
的部分和一个叫做 text
的部分。这两个部分都是一些标记的列表,其中有些标记被标记为树的结构。比如,下面这个(有点不太正规的)例子就可以工作:
import nltk
import re
IN = re.compile (r'.*\bin\b(?!\b.+ing)')
class doc():
pass
doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print nltk.sem.relextract.show_raw_rtuple(rel)
运行这个代码后,会得到这样的输出:
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
当然,你不会真的这样写代码,但这个例子展示了 extract_rels
期望的数据格式。你只需要弄清楚如何处理你的数据,使其符合这个格式就可以了。