SpaCy自定义NER模型:依赖项分析器训练错误

2024-04-16 03:50:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用spacy构建一个定制的NER模型。在为实体构建模型之后,有必要为依赖关系解析器培训模型。 我试着按照Spacy网站上提供的示例代码进行操作,如下所示:https://spacy.io/usage/training#tagger-parser

SpaCy网站提供的培训数据示例代码为:

TRAIN_DATA = [
(
    "They trade mortgage-backed securities.",
    {
        "heads": [1, 1, 4, 4, 5, 1, 1],
        "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
    },
)]

在这个示例代码中,对于训练数据,有一个名为“heads”的标签。我不太清楚它到底是什么,它在代码中的意义是什么。

我尝试在训练数据中不使用“heads”标签的情况下运行模型。培训数据样本如下:

TRAIN_PARSER = ('Mr Manjunath who is in-charge of the motor at their Goa location.', {'deps': ['compound',    'ROOT',    'nsubj',    'relcl',    'prep',    'punct',    'pobj',    'prep',    'det',    'pobj',    'prep',    'poss', 'compound','pobj', 'punct']})

当我尝试在不使用下面给出的heads标签的情况下运行模型时:

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding


# training data
TRAIN_DATA = TRAIN_PARSER


@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model='model1', output_dir='model2', n_iter=74):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if "parser" not in nlp.pipe_names:
    parser = nlp.create_pipe("parser")
    nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
    parser = nlp.get_pipe("parser")

# add labels to the parser
for _, annotations in TRAIN_DATA:
    for dep in annotations.get('deps', []):
        parser.add_label(dep)

# get names of other pipes to disable them during training
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train parser
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)

# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])

# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    doc = nlp2(test_text)
    print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])

    main(model='model1', output_dir='model2', n_iter=74)

我得到以下错误:

IndexError: list index out of range

有人能给我解释一下,这里到底是什么问题,我该如何解决?此外,如何为我的培训数据生成“头部”标签


Tags: thetotextinimportparserforoutput
1条回答
网友
1楼 · 发布于 2024-04-16 03:50:51

需要heads信息来标识令牌的直接“父”在树中是什么。例如,在

"I like London and Berlin.",
        {
            "heads": [1, 1, 1, 2, 2, 1],
            "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
        },

单词I的首字母位于索引1处,即单词like,并通过依赖关系nsubj与其相连

有关该术语的更多信息,请参见spaCy文档:https://spacy.io/usage/linguistic-features#navigating

相关问题 更多 >