错误:令牌只能是一个实体的一部分,因此请确保正在设置的实体不重叠

2024-04-23 07:17:42 发布

您现在位置:Python中文网/ 问答频道 /正文

尝试将spaCy NER数据集格式转换为Flair格式时,请使用以下代码:

from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = TRAIN_DATA

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

我经历了一个重叠错误:

ValueError: [E103] Trying to set conflicting doc.ents: '(1155, 1199, 'Email Address')' and '(1143, 1240, 'Links')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

以下是一个例子:

[('Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Programming Languages: C, C++, Java, .net, php.\n• Web Designing: HTML, XML\n• Operating Systems: Windows […] Windows Server 2003, Linux.\n• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.\n\nhttps://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN',
  {'entities': [(1155, 1199, 'Email Address'),
    (1143, 1240, 'Links'),
    (743, 1141, 'Skills'),
    (729, 733, 'Graduation Year'),
    (706, 728, 'Location'),
    (675, 703, 'College Name'),
    (631, 673, 'Degree'),
    (625, 630, 'Graduation Year'),
    (614, 623, 'College Name'),
    (606, 612, 'Degree'),
    (458, 479, 'Location'),
    (438, 454, 'Companies worked at'),
    (104, 148, 'Email Address'),
    (62, 68, 'Location'),
    (0, 14, 'Name')]}),

Tags: andoftoinfromdocaddressemail
1条回答
网友
1楼 · 发布于 2024-04-23 07:17:42

prodigy/spacy support

The entity recognizer is constrained to predict only non-overlapping, non-nested >spans. The training data should obey the same constraint. If you like, you could >have two sentences with the different annotations in your data. I’m not sure >whether this would hurt or help your performance, though.

我可以从错误消息中看到email(起始span:1155,结束span:1199)和links(起始span:1143,结束span:1240)的跨度重叠。在使用代码之前,需要解析重叠的注释

相关问题 更多 >