将嵌套括号树转为嵌套列表
我有一个树形结构的文件,里面用括号来表示这个树。下面是把它转换成Python嵌套列表的代码。
def foo(s):
def foo_helper(level=0):
try:
token = next(tokens)
except StopIteration:
if level != 0:
raise Exception('missing closing paren')
else:
return []
if token == ')':
if level == 0:
raise Exception('missing opening paren')
else:
return []
elif token == '(':
return [foo_helper(level+1)] + foo_helper(level)
else:
return [token] + foo_helper(level)
tokens = iter(s)
return foo_helper()
这个方法是参考了 如何解析字符串并返回嵌套数组。
在这里,当字符长度为1时,这个方法运行得很好。但是对于单词或句子,它就不太好用了。 我的树的示例是:
( Satellite (span 69 74) (rel2par Elaboration)
( Nucleus (span 69 72) (rel2par span)
( Nucleus (span 69 70) (rel2par span)
( Nucleus (leaf 69) (rel2par span) (text _!MERRILL LYNCH READY ASSETS TRUST :_!) )
( Satellite (leaf 70) (rel2par Elaboration) (text _!8.65 % ._!) )
)
( Satellite (span 71 72) (rel2par Elaboration)
( Nucleus (leaf 71) (rel2par span) (text _!Annualized average rate of return_!) )
( Satellite (leaf 72) (rel2par Temporal) (text _!after expenses for the past 30 days ;_!) )
)
)
( Satellite (span 73 74) (rel2par Elaboration)
( Nucleus (leaf 73) (rel2par span) (text _!not a forecast_!) )
( Satellite (leaf 74) (rel2par Elaboration) (text _!of future returns ._!) )
)
)
在这里,我希望输出应该是
['satellite',['span','69','74'].........]
,但是用这个给定的函数,我得到的却是 ['s','a','t'...............['s','p','a','n','7','3']..............]
我该如何修改这个呢?
2 个回答
1
你应该不是直接在字符串上调用这个函数,而是要在一个词语列表上调用,也就是把字符串用split
分开:
def parse(s):
def parse_helper(level=0):
try:
token = next(tokens)
except StopIteration:
if level:
raise Exception('Missing close paren')
else:
return []
if token == ')':
if not level:
raise Exception('Missing open paren')
else:
return []
elif token == '(':
return [parse_helper(level+1)] + parse_helper(level)
else:
return [token] + parse_helper(level)
tokens = iter(s)
return parse_helper()
if __name__ == '__main__':
with open('tree.thing', 'r') as treefile:
tree = treefile.read()
print(parse(tree.split()))
在这里,treefile
包含了你发的那个数据结构,我得到了这个输出:
[['Satellite', '(span', '69', '74)', '(rel2par', 'Elaboration)', ['Nucleus', '(span', '69', '72)', '(rel2par', 'span)', ['Nucleus', '(span', '69', '70)', '(rel2par', 'span)', ['Nucleus', '(leaf', '69)', '(rel2par', 'span)', '(text', '_!MERRILL', 'LYNCH', 'READY', 'ASSETS', 'TRUST', ':_!)'], ['Satellite', '(leaf', '70)', '(rel2par', 'Elaboration)', '(text', '_!8.65', '%', '._!)']], ['Satellite', '(span', '71', '72)', '(rel2par', 'Elaboration)', ['Nucleus', '(leaf', '71)', '(rel2par', 'span)', '(text', '_!Annualized', 'average', 'rate', 'of', 'return_!)'], ['Satellite', '(leaf', '72)', '(rel2par', 'Temporal)', '(text', '_!after', 'expenses', 'for', 'the', 'past', '30', 'days', ';_!)']]], ['Satellite', '(span', '73', '74)', '(rel2par', 'Elaboration)', ['Nucleus', '(leaf', '73)', '(rel2par', 'span)', '(text', '_!not', 'a', 'forecast_!)'], ['Satellite', '(leaf', '74)', '(rel2par', 'Elaboration)', '(text', '_!of', 'future', 'returns', '._!)']]]]
1
我以为你想用 _!
来表示带空格的字符串。于是我用正则表达式把这个表达式拆分开了:
from re import compile
resexp = compile(r'([()]|_!)')
…
tokens = iter(resexp.split(s))
…
我得到的结果是(使用 pprint,深度设置为4)
$ python lispparse.py | head
['\n',
[' Satellite ',
['span 69 74'],
' ',
['rel2par Elaboration'],
'\n ',
[' Nucleus ',
['span 69 72'],
' ',
['rel2par span'],
我又稍微改进了一下,得到了:
tokens = iter(filter(None, (i.strip() for i in resexp.split(s))))
最后得到了:
$ python lispparse.py
[['Satellite',
['span 69 74'],
['rel2par Elaboration'],
['Nucleus',
['span 69 72'],
['rel2par span'],
['Nucleus', [...], [...], [...], [...]],
['Satellite', [...], [...], [...], [...]]],
['Satellite',
['span 73 74'],
['rel2par Elaboration'],
['Nucleus', [...], [...], [...]],
['Satellite', [...], [...], [...]]]]]