斯坦福corenlp的python包装器
pynlp的Python项目详细描述
Pynlp
斯坦福大学corenlp的pythonic包装。
说明
这个库为构建在^{
安装
- 从官方网站download page下载斯坦福corenlp。
- 解压缩文件并将
CORE_NLP
环境变量设置为指向目录。 - 从pip安装
pynlp
pip3 install pynlp
快速启动
启动服务器
使用给定的here指令启动StanfordCoreNLPServer
。或者,只需运行模块。
python3 -m pynlp
默认情况下,这将使用jvm的端口9000和4gb ram启动本地主机上的服务器。使用--help
选项可获取有关自定义配置的说明。
示例
让我们从一篇cnn文章的摘录开始。
text=('GOP Sen. Rand Paul was assaulted in his home in Bowling Green, Kentucky, on Friday, ''according to Kentucky State Police. State troopers responded to a call to the senator\'s ''residence at 3:21 p.m. Friday. Police arrested a man named Rene Albert Boucher, who they ''allege "intentionally assaulted" Paul, causing him "minor injury". Boucher, 59, of Bowling ''Green was charged with one count of fourth-degree assault. As of Saturday afternoon, he ''was being held in the Warren County Regional Jail on a $5,000 bond.')
实例化注释器
在这里,我们演示以下注释器:
- annotoators:tokenize、ssplit、pos、引理、ner、entitymentions、coref、情感、quote、openie
- 选项:openie.resolve\u coref
frompynlpimportStanfordCoreNLPannotators='tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment, quote, openie'options={'openie.resolve_coref':True}nlp=StanfordCoreNLP(annotators=annotators,options=options)
注释文本
nlp
实例是可调用的。使用它来注释文本并返回一个Document
对象。
document=nlp(text)print(document)# prints 'text'
句子拆分
让我们测试一下ssplit注释器。一个Document
对象在其Sentence
对象上迭代。
forindex,sentenceinenumerate(document):print(index,sentence,sep=' )')
输出:
0) GOP Sen. Rand Paul was assaulted in his home in Bowling Green, Kentucky, on Friday, according to Kentucky State Police.
1) State troopers responded to a call to the senator's residence at 3:21 p.m. Friday.
2) Police arrested a man named Rene Albert Boucher, who they allege "intentionally assaulted" Paul, causing him "minor injury".
3) Boucher, 59, of Bowling Green was charged with one count of fourth-degree assault.
4) As of Saturday afternoon, he was being held in the Warren County Regional Jail on a $5,000 bond.
命名实体识别
把文件里提到的人都找出来怎么样?
[str(entity)forentityindocument.entitiesifentity.type=='PERSON']
输出:
Out[2]: ['Rand Paul', 'Rene Albert Boucher', 'Paul', 'Boucher']
我们也可以在句子层面使用命名实体。
first_sentence=document[0]forentityinfirst_sentence.entities:print(entity,'({})'.format(entity.type))
输出:
GOP (ORGANIZATION)
Rand Paul (PERSON)
Bowling Green (LOCATION)
Kentucky (LOCATION)
Friday (DATE)
Kentucky State Police (ORGANIZATION)
词性标注
让我们在第一句话中找到所有的“vb”标记。一个Sentence
对象遍历Token
对象。
fortokeninfirst_sentence:if'VB'intoken.pos:print(token,token.pos)
输出:
was VBD
assaulted VBN
according VBG
元素化
用同样的词,让我们看看引理。
fortokeninfirst_sentence:if'VB'intoken.pos:print(token,'->',token.lemma)
输出:
was -> be
assaulted -> assault
according -> accord
共指结果
让我们使用pynlp来查找文本中的第一个CorefChain
。
chain=document.coref_chains[0]print(chain)
输出:
((GOP Sen. Rand Paul))-[id=4] was assaulted in (his)-[id=5] home in Bowling Green, Kentucky, on Friday, according to Kentucky State Police.
State troopers responded to a call to (the senator's)-[id=10] residence at 3:21 p.m. Friday.
Police arrested a man named Rene Albert Boucher, who they allege "(intentionally assaulted" Paul)-[id=16], causing him "minor injury.
在字符串表示中,coreference用括号标记,referent用双括号标记。
每一个也用coref_id
标记。让我们仔细看一下参照物。
ref=chain.referentprint('Coreference: {}\n'.format(ref))forattrin'type','number','animacy','gender':print(attr,getattr(ref,attr),sep=': ')# Note that we can also index coreferences by idassertchain[4].is_referent
输出:
Coreference: Police
type: PROPER
number: SINGULAR
animacy: ANIMATE
gender: UNKNOWN
引号
从文本中提取引号很简单。
print(document.quotes)
输出:
[<Quote: "intentionally assaulted">, <Quote: "minor injury">]
TOdo(注释包装器):
- [X]ssplit
- []净资产
- [X]位置
- [X]引理
- [X]岩芯
- [X]引号
- []报价.归属
- []解析
- []深度分析
- [X]实体规则
- []OpenIE
- []情绪
- []关系
- []kbp
- []实体链接
- []选项示例,即openie.resolve\u coref
保存注释
写入
Pynlp文档可以保存为字节字符串。
withopen('annotation.dat','wb')asfile:file.write(document.to_bytes())
读取
要加载pynlp文档,请使用from_bytes
类方法实例化Document
。
frompynlpimportDocumentwithopen('annotation.dat','rb')asfile:document=Document.from_bytes(file.read())