在Python中解析大型XML文档的最快方法是什么？

from xml.parsers import expat class Element(object): def __init__(self, name, attributes): self.name = name self.attributes = attributes self.cdata = '' self.children = [] def addChild(self, element): self.children.append(element) def getAttribute(self,key): return self.attributes.get(key) def getData(self): return self.cdata def getElements(self, name=''): if name: return [c for c in self.children if c.name == name] else: return list(self.children) class Xml2Obj(object): def __init__(self): self.root = None self.nodeStack = [] def StartElement(self, name, attributes): element = Element(name.encode(), attributes) if self.nodeStack: parent = self.nodeStack[-1] parent.addChild(element) else: self.root = element self.nodeStack.append(element) def EndElement(self, name): self.nodeStack.pop() def CharacterData(self,data): if data.strip(): data = data.encode() element = self.nodeStack[-1] element.cdata += data def Parse(self, filename): Parser = expat.ParserCreate() Parser.StartElementHandler = self.StartElement Parser.EndElementHandler = self.EndElement Parser.CharacterDataHandler = self.CharacterData ParserStatus = Parser.Parse(open(filename).read(),1) return self.root

3条回答

网友

1楼 · 编辑于 2024-05-15 13:24:00

在我看来，您不需要程序中的任何DOM功能。我支持使用（c）ElementTree库。如果使用cElementTree模块的iterparse函数，则可以处理xml并在事件发生时进行处理。

但是，请注意，Fredriks关于使用celementreeiterparse function的建议：

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

lxml.iterparse()不允许这样做。

前面的代码不适用于Python3.7，请考虑以下方法获取第一个元素。

# get an iterable
context = iterparse(source, events=("start", "end"))

is_first = True

for event, elem in context:
    # get the root element
    if is_first:
        root = elm
        is_first = False
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

网友

2楼 · 编辑于 2024-05-15 13:24:00

我建议您使用lxml，它是libxml2库的python绑定，非常快。

根据我的经验，libxml2和expat的性能非常相似。但我更喜欢libxml2（和python的lxml），因为它似乎更积极地开发和测试。libxml2还有更多的特性。

lxml主要是与xml.etree.ElementTree兼容的API。它的网站上也有很好的文档。

网友

3楼 · 编辑于 2024-05-15 13:24:00

你试过cElementTree模块吗？

cElementTree作为xml.etree.cElementTree包含在Python2.5及更高版本中。参考benchmarks。

移除死图像库链接

相关问题更多 >

编程相关推荐

热门问题

热门文章