训练IOB Chunker使用训练器（转型学习）

1条回答

网友

1楼 · 发布于 2024-05-15 11:59:44

nltk3-brill-trainer-api（我写的）确实处理多维描述的令牌序列的训练特性，如您的数据所示。然而，实际限制可能很严重。多维学习中可能的模板数急剧增加，当前的nltk实现的BRIL训练器交换内存对于速度，类似于Ramshaw和Marcus1994，“探索转换规则序列的统计推导…”。内存消耗量可能很大提供更多的模板和数据是非常容易的它可以处理。一个有用的策略是排名根据模板生成良好规则的频率（请参见在下面的示例中打印_template_statistics（））。通常，你可以放弃得分最低的分数（比如50-90%）表现几乎没有损失，训练时间大大减少。在

另一种或附加的可能性是使用nltk Brill原始算法的实现，它有非常不同的内存速度折衷；它没有索引，因此将使用更少的内存。它使用了一些优化，实际上在找到最好的规则方面相当快，但是当有许多竞争对手、低得分的候选人时，通常在训练结束时速度非常慢。有时候你根本不需要这些。由于某些原因，这个实现似乎在新的nltk中被省略了，但是这里是源代码（我刚刚测试过它）http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html。在

还有其他算法和其他折衷，以及特别是Florian和Ngai 2000的快速内存高效索引算法（http://www.aclweb.org/anthology/N/N01/N01-1006.pdf）和塞缪尔1998年概率规则抽样（https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf）将是一个有用的补充。另外，正如您所注意到的，文档并不完整，过于关注词性标注，而且不清楚如何从中概括。修复文档也在待办事项列表中。在

然而，人们对nltk中的广义（非词性标记）tbl的兴趣是相当有限的（nltk2完全不适合的api已经有10年没碰过了），所以不要屏住呼吸。如果你不耐烦的话，尤其是mutbl和fntbl（google他们，我只有两个链接的声誉）。在

总之，以下是nltk的一个快速草图：

首先，nltk中的一个硬编码约定是标记序列（“tags”表示任何标签）你想分配给你的数据，不一定是词性）都有表示作为成对序列，[（token1，tag1），（token2，tag2），…]。标签是字符串；在许多基本应用程序，令牌也是如此。例如，标记可以是单词而字符串的位置，如

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

（顺便说一句，这种令牌标记对序列约定在nltk和它的文档，但可以说应该更好地用命名元组来表示而不是成对的，所以不是说

^{pr2}$

比如你可以说

^{3}$

第一种情况在非对上失败，但第二种情况利用了duck类型标记的序列可以是任何用户定义的实例序列，只要它们有一个属性“token”。）

现在，你可以有一个更丰富的表征，什么是代币在你的处置。现有的标记器接口(nltk.tag.api.FeaturesetTaggerI）需要每个标记都是一个featureset而不是一个字符串，字符串是一个映射的字典要素名称到序列中每个项目的要素值。在

标记序列可能看起来像

[({'word': 'Pierre', 'tag': 'NNP', 'iob': 'B-NP'}, 'NNP'),
 ({'word': 'Vinken', 'tag': 'NNP', 'iob': 'I-NP'}, 'NNP'),
 ({'word': ',',      'tag': ',',   'iob': 'O'   }, ','),
 ...
]

还有其他的可能性（尽管nltk的其他支持较少）。例如，可以为每个令牌指定一个命名元组，或者用户定义一个类，该类允许向中添加任何数量的动态计算属性访问（可能使用@property来提供一致的接口）。在

brill tagger不需要知道您当前提供的视图在你的代币上。但是，它确实要求您提供initial标签它可以将你的表示中的标记序列标签。不能在中使用现有标记器nltk.tag.顺序直接，因为他们期望[（单词，标签），…]。但你还是可以利用它们。下面的示例使用此策略（在MyInitialTagger中）和标记as featureset dictionary视图。在

from __future__ import division, print_function, unicode_literals

import sys

from nltk import tbl, untag
from nltk.tag.brill_trainer import BrillTaggerTrainer
# or: 
# from nltk.tag.brill_trainer_orig import BrillTaggerTrainer
# 100 templates and a tiny 500 sentences (11700 
# tokens) produce 420000 rules and uses a 
# whopping 1.3GB of memory on my system;
# brill_trainer_orig is much slower, but uses 0.43GB

from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags
from nltk.tag import DefaultTagger


def get_templates():
    wds10 = [[Word([0])],
             [Word([-1])],
             [Word([1])],
             [Word([-1]), Word([0])],
             [Word([0]), Word([1])],
             [Word([-1]), Word([1])],
             [Word([-2]), Word([-1])],
             [Word([1]), Word([2])],
             [Word([-1,-2,-3])],
             [Word([1,2,3])]]

    pos10 = [[POS([0])],
             [POS([-1])],
             [POS([1])],
             [POS([-1]), POS([0])],
             [POS([0]), POS([1])],
             [POS([-1]), POS([1])],
             [POS([-2]), POS([-1])],
             [POS([1]), POS([2])],
             [POS([-1, -2, -3])],
             [POS([1, 2, 3])]]

    iobs5 = [[IOB([0])],
             [IOB([-1]), IOB([0])],
             [IOB([0]), IOB([1])],
             [IOB([-2]), IOB([-1])],
             [IOB([1]), IOB([2])]]


    # the 5 * (10+10) = 100 3-feature templates 
    # of Ramshaw and Marcus
    templates = [tbl.Template(*wdspos+iob) 
        for wdspos in wds10+pos10 for iob in iobs5]
    # Footnote:
    # any template-generating functions in new code 
    # (as opposed to recreating templates from earlier
    # experiments like Ramshaw and Marcus) might 
    # also consider the mass generating Feature.expand()
    # and Template.expand(). See the docs, or for 
    # some examples the original pull request at
    # https://github.com/nltk/nltk/pull/549 
    # ("Feature- and Template-generating factory functions")

    return templates

def build_multifeature_corpus():
    # The true value of the target fields is unknown in testing, 
    # and, of course, templates must not refer to it in training.
    # But we may wish to keep it for reference (here, truepos).

    def tuple2dict_featureset(sent, tagnames=("word", "truepos", "iob")):
        return (dict(zip(tagnames, t)) for t in sent)

    def tag_tokens(tokens):
        return [(t, t["truepos"]) for t in tokens]
    # connlltagged_sents :: [[(word,tag,iob)]]
    connlltagged_sents = (tree2conlltags(sent) 
        for sent in treebank_chunk.chunked_sents())
    conlltagged_tokenses = (tuple2dict_featureset(sent) 
        for sent in connlltagged_sents)
    conlltagged_sequences = (tag_tokens(sent) 
        for sent in conlltagged_tokenses)
    return conlltagged_sequences

class Word(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["word"]

class IOB(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["iob"]

class POS(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][1]


class MyInitialTagger(DefaultTagger):
    def choose_tag(self, tokens, index, history):
        tokens_ = [t["word"] for t in tokens]
        return super().choose_tag(tokens_, index, history)


def main(argv):
    templates = get_templates()
    trainon = 100

    corpus = list(build_multifeature_corpus())
    train, test = corpus[:trainon], corpus[trainon:]

    print(train[0], "\n")

    initial_tagger = MyInitialTagger('NN')
    print(initial_tagger.tag(untag(train[0])), "\n")

    trainer = BrillTaggerTrainer(initial_tagger, templates, trace=3)
    tagger = trainer.train(train)

    taggedtest = tagger.tag_sents([untag(t) for t in test])
    print(test[0])
    print(initial_tagger.tag(untag(test[0])))
    print(taggedtest[0])
    print()

    tagger.print_template_statistics()

if __name__ == '__main__':
    sys.exit(main(sys.argv))

上面的设置构建了一个POS标记器。如果您希望以另一个属性为目标，比如构建一个IOB标记器，那么您需要进行一些小的更改使target属性（可以认为是读写的）从语料库中的“tag”位置访问[（token，tag），…] 以及任何其他属性（可以认为是只读的）从“令牌”位置访问。例如：

1）为IOB标记构建语料库[（token，tag），（token，tag），…]

def build_multifeature_corpus():
    ...

    def tuple2dict_featureset(sent, tagnames=("word", "pos", "trueiob")):
        return (dict(zip(tagnames, t)) for t in sent)

    def tag_tokens(tokens):
        return [(t, t["trueiob"]) for t in tokens]
    ...

2）相应地更改初始标记

...
initial_tagger = MyInitialTagger('O')
...

3）修改特征提取类定义

class POS(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["pos"]

class IOB(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][1]

相关问题更多 >

编程相关推荐

热门问题

热门文章