<p>这个答案依赖于Stanford CoreNLP来获得句子的依赖树。在使用networkx时,它从HugoMailhot的<a href="https://stackoverflow.com/a/32895132/395857">answer</a>中借用了相当多的代码。</p>
<p>在运行代码之前,需要:</p>
<ol>
<li><code>sudo pip install pycorenlp</code>(斯坦福CoreNLP的python接口)</li>
<li>下载<a href="http://stanfordnlp.github.io/CoreNLP" rel="nofollow noreferrer">Stanford CoreNLP</a></li>
<li><p>启动斯坦福CoreNLP服务器,如下所示:</p>
<pre><code>java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
</code></pre></li>
</ol>
<p>然后可以运行以下代码来查找两个单词之间的最短依赖路径:</p>
<pre><code>import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint
nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
output = nlp.annotate(text, properties={
"timeout": "10000",
"ssplit.newlineIsSentenceBreak": "two",
'annotators': annotators,
'outputFormat': 'json'
})
return output
# The code expects the document to contains exactly one sentence.
document = 'Robots in popular culture are there to remind us of the awesomeness of'\
'unbound human agency.'
print('document: {0}'.format(document))
# Parse the text
annotations = get_stanford_annotations(document, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']
# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
edges.append((edge['governor'], edge['dependent']))
dependencies[(min(edge['governor'], edge['dependent']),
max(edge['governor'], edge['dependent']))] = edge
graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))
# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
if token1 == token['originalText']:
token1_index = token['index']
if token2 == token['originalText']:
token2_index = token['index']
path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))
for token_id in path:
token = tokens[token_id-1]
token_text = token['originalText']
print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))
</code></pre>
<p>输出为:</p>
<pre><code>document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1 token_text: Robots
Node 5 token_text: are
Node 8 token_text: remind
Node 12 token_text: awesomeness
</code></pre>
<p>注意,斯坦福CoreNLP可以在线测试:<a href="http://nlp.stanford.edu:8080/parser/index.jsp" rel="nofollow noreferrer">http://nlp.stanford.edu:8080/parser/index.jsp</a></p>
<p>这个答案在windows7sp1x64ultimate上用斯坦福CoreNLP 3.6.0、pycorenlp 0.3.0和python3.5x64进行了测试。</p>