ElementTree(1.3.0) 中高效的 XML 解析方法

5 投票

5 回答

2231 浏览

提问于 2025-04-17 03:02

我正在尝试解析一个非常大的XML文件，大小在20MB到3GB之间。这些文件是来自不同仪器的样本。所以，我的工作就是从文件中找到必要的元素信息，然后把它们插入到数据库（Django）中。

这是我文件样本的一小部分。所有文件都有命名空间。文件的一个有趣特点是，它们有更多的节点属性而不是文本。

<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">

    <instrumentConfiguration id="QTOF">
                    <cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
                    <componentList count="4">
                            <source order="1">
                                    <cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
                            </source>
                            <analyzer order="2">
                                    <cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
                            </analyzer>
                            <analyzer order="3">
                                    <cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
                            </analyzer>
                            <detector order="4">
                                    <cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
                            </detector>
                    </componentList>
     </instrumentConfiguration>

一个小而完整的文件可以在这里找到。

到目前为止，我所做的就是对每个感兴趣的元素使用findall。

import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
    insattrib=s[ins].attrib
    # It will print out all the id attribute of instrument
    print insattrib["id"]

我该如何访问instrumentConfiguration (s)元素的所有子元素和孙元素？

s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')

我想要的例子

InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector

在存在命名空间的情况下，有没有更有效的方法来解析元素/子元素/孙元素？还是说我每次都必须使用find/findall来访问树中带有命名空间的特定元素？这只是一个小例子，我还需要解析更复杂的元素层次结构。

任何建议都欢迎！

编辑

没有得到正确的答案，所以我得再编辑一次！

数据处理命名空间 xml解析数据库插入节点属性元素树 findall方法复杂层次结构

5 个回答

在这种情况下，我会使用findall来找到所有的instrumentList元素。然后对这些结果进行操作，就像instrumentList和instrument是数组一样，你可以获取所有的元素，而不需要一个个去找。

回答于 2025-04-17 由 Python大师

分享举报

如果这个问题现在还存在，你可以试试pymzML，这是一个用来处理mzML文件的Python接口。你可以在这里找到它的官网：http://pymzml.github.com/

回答于 2025-04-17 由 Python大师

分享举报

这里有一个脚本，可以在 40 秒内解析一百万个 <instrumentConfiguration/> 元素（文件大小为 967MB），而且不会消耗大量内存。

这个处理速度是 24MB/s。根据 cElementTree 页面 (2005) 的报告，处理速度为 47MB/s。

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

输出

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

注意：这段代码比较脆弱，它假设 <instrumentConfiguration/> 的前两个子元素是 <cvParam/> 和 <componentList/>，并且所有的值都可以作为标签名或属性来获取。

关于性能

在这种情况下，ElementTree 1.3 的速度大约是 cElementTree 1.0.6 的六倍慢。

如果你把 root.clear() 替换成 elem.clear()，那么代码会快大约 10%，但是内存使用量会增加大约十倍。lxml.etree 使用 elem.clear() 的方式，性能和 cElementTree 一样，但内存消耗是 root.clear() 的20倍和 elem.clear() 的2倍（大约500MB）。

回答于 2025-04-17 由 Python大师

分享举报

ElementTree(1.3.0) 中高效的 XML 解析方法

5 个回答

输出

关于性能

撰写回答