将NLTK短语结构树转换为BRAT .ann标准格式

3 投票

1 回答

1801 浏览

提问于 2025-04-18 03:12

我正在尝试给一堆普通文本加注释。我使用的是系统功能语法，这种语法在词性标注方面比较标准，但在短语或块的处理上有所不同。

因此，我用NLTK的默认设置给我的数据进行了词性标注，并用nltk.RegexpParser制作了一个正则表达式块解析器。简单来说，现在的输出是一个NLTK风格的短语结构树：

Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])

不过，我还有一些内容想要手动添加注释：系统语法将参与者和动词组细分为一些子类型，这些可能无法自动标注。因此，我希望把解析树的格式转换成一个注释工具（最好是BRAT）可以处理的格式，然后逐个查看文本，手动指定子类型，就像下面这个可能的解决方案：

BRAT注释

也许解决方案是让BRAT把短语结构当作依赖关系来处理？如果需要的话，我可以修改块解析的正则表达式。有没有现成的转换工具？（BRAT提供从CONLL2000和斯坦福核心自然语言处理转换的方式，所以如果我能把短语结构转换成这两种格式中的任何一种，那也是可以的。）

谢谢！

正则表达式自然语言处理 nltk 词性标注语法解析短语结构树 brat 注释工具

1 个回答

用弧线表示一个非二叉树会比较困难，但我们可以嵌套“实体”注释，利用它来构建一个成分解析结构。需要注意的是，我并没有为树的终端节点（词性标签）创建节点，部分原因是Brat目前不太擅长显示通常适用于终端的单一规则。关于目标格式的描述可以在这里找到。

首先，我们需要一个函数来生成独立的注释。虽然Brat在处理字符时会寻找独立注释，但在接下来的内容中，我们只使用词元的偏移量，稍后会转换为字符。

（注意，这里使用的是NLTK 3.0b和Python 3）

def _standoff(path, leaves, slices, offset, tree):
    width = 0
    for i, child in enumerate(tree):
        if isinstance(child, tuple):
            tok, tag = child
            leaves.append(tok)
            width += 1
        else:
            path.append(i)
            width += _standoff(path, leaves, slices, offset + width, child)
            path.pop()
    slices.append((tuple(path), tree.label(), offset, offset + width))
    return width


def standoff(tree):
    leaves = []
    slices = []
    _standoff([], leaves, slices, 0, tree)
    return leaves, slices

将这个应用到你的例子中：

>>> from nltk.tree import Tree
>>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
>>> standoff(tree)
(['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
 [((0, 0, 0), 'Participant', 0, 1),
  ((0, 0, 1), 'Verbal-group', 1, 2),
  ((0, 0, 2), 'Participant', 2, 4),
  ((0, 0, 3), 'Circumstance', 4, 7),
  ((0, 0), 'Process-dependencies', 0, 7),
  ((0,), 'Clause', 0, 7),
  ((), 'S', 0, 8)])

这会返回叶子词元，然后是一个包含子树的元组列表，元素包括：（根节点的索引，标签，起始叶子，结束叶子）。

要将其转换为字符独立注释：

def char_standoff(tree):
    leaves, tok_standoff = standoff(tree)
    text = ' '.join(leaves)
    # Map leaf index to its start and end character
    starts = []
    offset = 0
    for leaf in leaves:
        starts.append(offset)
        offset += len(leaf) + 1
    starts.append(offset)
    return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
                  for path, label, start_tok, end_tok in tok_standoff]

然后：

>>> char_standoff(tree)
('This is a representation of the grammar .',
 [((0, 0, 0), 'Participant', 0, 4),
  ((0, 0, 1), 'Verbal-group', 5, 7),
  ((0, 0, 2), 'Participant', 8, 24),
  ((0, 0, 3), 'Circumstance', 25, 39),
  ((0, 0), 'Process-dependencies', 0, 39),
  ((0,), 'Clause', 0, 39),
  ((), 'S', 0, 41)])

最后，我们可以写一个函数，将其转换为Brat的格式：

def write_brat(tree, filename_prefix):
    text, standoff = char_standoff(tree)
    with open(filename_prefix + '.txt', 'w') as f:
        print(text, file=f)
    with open(filename_prefix + '.ann', 'w') as f:
        for i, (path, label, start, stop) in enumerate(standoff):
            print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)

这会将以下内容写入/path/to/something.txt：

This is a representation of the grammar .

并将这些内容写入/path/to/something.ann：

T0  Participant 0 4 This
T1  Verbal-group 5 7    is
T2  Participant 8 24    a representation
T3  Circumstance 25 39  of the grammar
T4  Process-dependencies 0 39   This is a representation of the grammar
T5  Clause 0 39 This is a representation of the grammar
T6  S 0 41  This is a representation of the grammar .

回答于 2025-04-18 由 Python大师

分享举报

将NLTK短语结构树转换为BRAT .ann标准格式

1 个回答

撰写回答