在Python中解析大型XML文档的最快方法是什么?

2024-04-29 09:37:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在运行基于Python食谱第12.5章的以下代码:

from xml.parsers import expat

class Element(object):
    def __init__(self, name, attributes):
        self.name = name
        self.attributes = attributes
        self.cdata = ''
        self.children = []
    def addChild(self, element):
        self.children.append(element)
    def getAttribute(self,key):
        return self.attributes.get(key)
    def getData(self):
        return self.cdata
    def getElements(self, name=''):
        if name:
            return [c for c in self.children if c.name == name]
        else:
            return list(self.children)

class Xml2Obj(object):
    def __init__(self):
        self.root = None
        self.nodeStack = []
    def StartElement(self, name, attributes):
        element = Element(name.encode(), attributes)
        if self.nodeStack:
            parent = self.nodeStack[-1]
            parent.addChild(element)
        else:
            self.root = element
        self.nodeStack.append(element)
    def EndElement(self, name):
        self.nodeStack.pop()
    def CharacterData(self,data):
        if data.strip():
            data = data.encode()
            element = self.nodeStack[-1]
            element.cdata += data
    def Parse(self, filename):
        Parser = expat.ParserCreate()
        Parser.StartElementHandler = self.StartElement
        Parser.EndElementHandler = self.EndElement
        Parser.CharacterDataHandler = self.CharacterData
        ParserStatus = Parser.Parse(open(filename).read(),1)
        return self.root

我正在处理大约1 GB大小的XML文档。有没有人知道一种更快的方法来解析这些?


Tags: nameselfparserdatareturnifdefroot
3条回答

在我看来,您不需要程序中的任何DOM功能。我支持使用(c)ElementTree库。如果使用cElementTree模块的iterparse函数,则可以处理xml并在事件发生时进行处理。

但是,请注意,Fredriks关于使用celementreeiterparse function的建议:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

lxml.iterparse()不允许这样做。

前面的代码不适用于Python3.7,请考虑以下方法获取第一个元素。

# get an iterable
context = iterparse(source, events=("start", "end"))

is_first = True

for event, elem in context:
    # get the root element
    if is_first:
        root = elm
        is_first = False
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

我建议您使用lxml,它是libxml2库的python绑定,非常快。

根据我的经验,libxml2和expat的性能非常相似。但我更喜欢libxml2(和python的lxml),因为它似乎更积极地开发和测试。libxml2还有更多的特性。

lxml主要是与xml.etree.ElementTree兼容的API。它的网站上也有很好的文档。

你试过cElementTree模块吗?

cElementTree作为xml.etree.cElementTree包含在Python2.5及更高版本中。参考benchmarks

移除死图像库链接

相关问题 更多 >