带有混合标记和tex的自定义lxml TreeBuilder

2024-06-16 17:13:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用lxml库将xml转换为dict。 但我有半结构化的XML,比如: <doc>This is<i>xmlobj</i> documentation.<b> SUPER</b> great</doc>

如果我这样做了:

import xml.etree.cElementTree as ET
from collections import defaultdict
import time

my_file = "aaa.xml"


def etree_to_dict(t):
    d = {t.tag: {} if t.attrib else None}
    children = list(t)
    if children:
        dd = defaultdict(list)
        for dc in map(etree_to_dict, children):
            for k, v in dc.items():
                dd[k].append(v)
        d = {t.tag: {k: v[0] if len(v) == 1 else v
                     for k, v in dd.items()}}
    if t.attrib:
        d[t.tag].update(('@' + k, v)
                        for k, v in t.attrib.items())
    if t.text:
        text = t.text.strip()
        if children or t.attrib:
            if text:
              d[t.tag]['#text'] = text
        else:
            d[t.tag] = text
    return d

from lxml import etree as ET

count = 0
p=True

tree = ET.parse(my_file)
root = tree.getroot()
#iter = root.iter('PubmedArticle')
iter = root.iter('example')

for entry in iter:
    count+= 1
    if p:
      p=False
      r = etree_to_dict(entry)
      print(r)

print(count)

我得到:

{'example': {'doc': {'i': 'xmlobj', 'b': 'SUPER', '#text': 'This is'}, 'a': 'Oui'}}

“伟大”不见了

我试着实现我自己的树生成器:

from lxml import etree as ET


parser = ET.XMLParser( target = myTreeBuilder())
tree = ET.parse(my_file,parser)


r = etree_to_dict(tree)

但我不明白我该怎么实施。你知道吗

默认的TreeBuilder在那里:https://github.com/python/cpython/blob/master/Lib/xml/etree/ElementTree.py


Tags: totextinimporttreeforiftag