从XML生成spacy的训练数据

Question

我有一些XML格式的数据，长得像这样：

<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>

我想把这些文本用作spacy的训练数据，所以我需要把它转换成spacy要求的格式：

doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": [(0, 5, "PERSON"), (14, 28, "LOC")]}
example = Example.from_dict(doc, gold_dict)

特别是关于创建偏移量，也就是实体的开始和结束位置，我还是搞不清楚。有没有什么特别合适的方法来处理这个？

编辑：这是我到目前为止用ElementTree尝试的结果：

from xml.etree import ElementTree as ET

data = '''
<root>
<item n="main"><anchor type="b" ana="regO.lemID_12" xml:id="TidB13" />Stuttgart<anchor type="e" ana="reg0.lemID_12" xml:id="TidE13" /> d. 20. Sept [19]97<lb/>Lieber Herr Schmidt!<lb/>Ich bin sehr glücklich über die Aufnahme <anchor type="b" ana="regW.lemID_17" xml:id="TidB22" />meines <anchor type="b" ana="regP.lemID_4" xml:id="TidB4" />Shakespeare<anchor type="e" ana="regP.lemID_4" xml:id="TidE4" /><anchor type="e" ana="regW.lemID_17" xml:id="TidE22" /> bei euch, vielen Dank.</item>
</root>
'''
def get_entity_type(ana):
    if 'regO' in ana:
        return 'PLACE'
    if 'regP' in ana:
        return 'PERSON'
    if 'regW' in ana:
        return 'WORK'
    if 'regP' in ana:
        return "PERIODICA"
 
root = ET.fromstring(data)
print(root)
#text = ""
entities = []
current_pos = 0

for node in root.iter():
    #print(node)
    if node.tag == "anchor" and node.get('type')=='b':
        start_pos = current_pos
        ana = node.get('ana')
        entity_type = get_entity_type(ana)
        #print(entity_type)
    elif node.tag == "anchor" and node.get('type')=='e':
        entities.append((entity_type, start_pos, current_pos))       
                    
#print (entities)

所以抓取实体类型是可以的，但我想要抓取实体的开始和结束位置的想法是错的。我也尝试用pawpaw来做，像这里描述的那样。但它总是找不到"Ito"

这是我用pawpaw尝试的：

from pawpaw import ito
root = ET.fromstring(data)
elements = root.findall('.//')
print(elements)

for e in elements:
    plain_text = e.Ito.find('*[d:text]')
#     print(plain_text)

XML elementtree 偏移量数据转换训练数据 spacy 实体识别 pawpaw

从XML生成spacy的训练数据

1 个回答

撰写回答