移除NER处的B和I标签

2024-05-19 01:35:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有新闻文章,我想用deepavlov来做那篇文章。实体使用生物标记方案。此处“B”表示实体的开头,“I”表示“内部”,并用于除第一个之外的所有组成实体的单词,“O”表示没有实体。NER代码如下所示:

def listOfTuples(list1, list2): 
    return list(map(lambda x, y:(x,y), list1, list2)) 

ner_result = []
for x in split:
    for y in split[0]:
        news_ner = ner_model([str(y)])
        teks =  news_ner[0][0]
        tag = news_ner[1][0]
        ner_result.extend(listOfTuples(teks, tag))

print([i for i in ner_result if i[1] != 'O'])

嗯,结果是这样的

[('KOMPAScom', 'B-ORG'), ('Kompascom', 'I-ORG'), ('IFCN', 'B-ORG'), ('-', 'I-ORG'), ('International', 'I-ORG'), ('Fact', 'I-ORG'), ('-', 'I-ORG'), ('Checking', 'I-ORG'), ('Network', 'I-ORG'), ('Kompascom', 'B-ORG'), ('49', 'B-CARDINAL'), ('IFCN', 'B-ORG'), ('Kompascom', 'B-ORG'), ('Redaksi', 'B-ORG'), ('Kompascom', 'I-ORG'), ('Wisnu', 'B-PERSON'), ('Nugroho', 'I-PERSON'), ('Jakarta', 'B-GPE'), ('Rabu', 'B-DATE'), ('17', 'I-DATE'), ('/', 'I-DATE'), ('10', 'I-DATE'), ('/', 'I-DATE'), ('2018', 'I-DATE'), ('KOMPAScom', 'B-ORG'), ('Redaksi', 'B-ORG'), ('Kompascom', 'I-ORG'), ('Wisnu', 'B-PERSON'), ('Nugroho', 'I-PERSON'), ('Kompascom', 'B-ORG'), ('Bentara', 'I-ORG'), ('Budaya', 'I-ORG'), ('Jakarta', 'I-ORG'), ('Palmerah', 'I-ORG')]

我想删除B和I的标记,然后合并标记B和I中的文本,因此输出如下

[('KOMPAScom Kompascom', 'ORG'), ('IFCN - International Fact - Checking Network', 'ORG'), ('Kompascom', 'ORG'), ('49', 'CARDINAL'), ('IFCN', 'ORG'), ('Kompascom', 'ORG'), ('Redaksi Kompascom', 'ORG'), ('Wisnu Nugroho', 'PERSON'), ('Jakarta', 'GPE'), ('Rabu 17/10/2018', 'DATE'), ('KOMPAScom', 'ORG'), ('Redaksi Kompascom', 'ORG'), ('Wisnu Nugroho', 'PERSON'), ('Kompascom Bentara Budaya Jakarta Palmerah', 'ORG')]

你有什么想法吗


Tags: 标记org实体fordateresultpersonner
1条回答
网友
1楼 · 发布于 2024-05-19 01:35:31

您可以简单地迭代标记的文本并连接属于同一实体的标记。它的优雅程度并不惊人,但却很管用。大概是这样的:

def collapse(ner_result):
    # List with the result
    collapsed_result = []

    # Buffer for tokens belonging to the most recent entity
    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:
        if tag == "O":
            continue
        # If an enitity span starts ...
        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append(
                    (" ".join(current_entity_tokens), current_entity))

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]
        # If the entity continues ...
        elif tag == "I-" + current_entity:
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            raise ValueError("Invalid tag order.")

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append(
            (" ".join(current_entity_tokens), current_entity))
    return collapsed_result

相关问题 更多 >

    热门问题