二进制解析树表示的字符串列表

2024-06-13 05:54:38 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个用括号表示的二叉树,如下所示:

best_parse = "(ROOT (S (S (S (S (S Our) (S intent)) (S is)) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))))) (S (S he) (S says))))"

我可以使用tree_to_spans方法从中获取跨度(成分),该方法提供了[(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]

import nltk
import collections

def tree_to_spans(tree):
    if isinstance(tree, str):
        tree = nltk.Tree.fromstring(tree)
    length = len(tree.pos())
    queue = collections.deque(tree.treepositions())
    stack = [(queue.popleft(), 0)]
    j = 0
    spans = []
    while stack != []:
        (p, i) = stack[-1]
        if not queue or queue[0][:-1] != p:
            if isinstance(tree[p], nltk.tree.Tree):
                if j - i > 1:
                    spans.append((tree[p].label(), (i, j)))
            else:
                j = i + 1
            stack.pop()
        else:
            q = queue.popleft()
            stack.append((q, j))
    return spans

我还可以使用get_constituents方法获得对应于span的字符串,该方法提供了树的所有组成部分

['Our intent', 'Our intent is', 'the best', 'the best alternative', 'promote the best alternative', 'to promote the best alternative', 'Our intent is to promote the best alternative', 'he says', 'Our intent is to promote the best alternative he says']

def get_constituents(sample_string):
    t = nltk.Tree.fromstring(sample_string)
    spans = evaluate.tree_to_spans(t)
    sentence = " ".join(item[0] for item in t.pos()).split()
    constituents = [" ".join(sentence[span[0]: span[1]])for span in spans]
    # Add original sentence
    constituents = constituents + [" ".join(sentence)]
    return constituents

问题:给定所有成分的列表,我无法将其转换回树(如开头所示)。基本上,逆向工程。假设句子是:

s = "Our intent is to promote the best alternative he says"

编辑

假设可以从列表(零件)中删除某些元素。例如-

parts = [
    'Our intent', 
    'the best', 
    'the best alternative', 
    'promote the best alternative', 
    'to promote the best alternative', 
    'Our intent is to promote the best alternative', 
    'Our intent is to promote the best alternative he says'
]

这里,"He says""Our intent is"已从列表中删除。我希望得到与上面相同格式的树结构,除了删除的成分

另一种看待它的方式是:

P>考虑,我们有跨度-

spans = [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]

我删除了(0, 3)(8, 10)

我想用括号括起来,像这样:

(((0  1  2)  (3  (4  ((5  6  7)  8))))  9  10)

然后,我们可以用相应的字符串映射索引

编辑-2

例如,如果我们只从部件中删除"he says""Our intent is",则括号形式的最终树应该如下所示:

"(ROOT (S (S (S (S (S Our) (S intent)) (S is) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative)))))(S he) (S says))))")

另一个例子是,如果我们只从部件中删除"to promote the best alternative","Our intent is to promote the best alternative",则括号形式的最终树应该如下所示:

"(ROOT (S (S (S Our) (S intent)) (S is)) (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))) (S (S he) (S says)))"

我们可以假设完整的句子"Our intent is to promote the best alternative he says"永远不会被删除。这也适用于句子中的单个单词,只是为了给你一个背景


Tags: thetotreequeueisstackourbest
1条回答
网友
1楼 · 发布于 2024-06-13 05:54:38

我将以相反的顺序迭代这些成分,以便获得树的有序遍历(在遍历左侧之前遍历右侧)。因此,在这个解决方案中,我假设成分的顺序与从代码中获得它们的顺序相同

通过递归,您可以递归地重建每个子树:

def to_tree(parts):
    i = len(parts) - 1

    def recur(part, expect=False):
        nonlocal i
        words = part.split()
        if len(words) == 1:  # leaf
            return "(S {})".format(words[0])
        if expect and i > 0 and parts[i-1] == part:
            i -= 1
        if len(words) == 2:  # 2 leaves
            return "(S (S {}) (S {}))".format(*words)
        i -= 1
        nextpart = parts[i]
        if part.endswith(" " + nextpart):
            right = recur(nextpart)
            left = recur(part[0:-len(nextpart)-1], True)
        elif part.startswith(nextpart + " "):
            right = recur(part[len(nextpart)+1:], True)
            left = recur(nextpart)
        else: 
            sides = part.split(" " + nextpart + " ")
            assert len(sides) == 2, "Could not parse input" 
            right = recur(sides[1], True)
            left = recur(sides[0], True)
        return "(S {} {})".format(left, right)
            
    return "(ROOT {})".format(recur(parts[i]))

该示例可以按以下方式运行:

parts = [
    'Our intent', 
    'Our intent is', 
    'the best', 
    'the best alternative', 
    'promote the best alternative', 
    'to promote the best alternative', 
    'Our intent is to promote the best alternative', 
    'he says', 
    'Our intent is to promote the best alternative he says'
]

print(to_tree(parts))

…将输出原始字符串

根据您的编辑,上述代码可以在从输入中删除一些内容后“保留”下来。例如,“我们的意图是”和“他说”可以从输入中删除,而输出仍然相同。但也有局限性。如果删除的内容过多,则无法再重建树

相关问题 更多 >