用Python XML.sax解析XML：如何“跟踪”您在树中的位置？

import xml.sax class Exact(xml.sax.handler.ContentHandler): def __init__(self): self.curpath = [] def startElement(self, name, attrs): self.curpath.append(name) if name == 'Phone': print self.curpath, name def endElement(self, name): self.curpath.pop() if __name__ == '__main__': parser = xml.sax.make_parser() handler = Exact() parser.setContentHandler(handler) parser.parse(open('/home/cronuser/xml/mount/daily/debtors.xml'))

3条回答

网友

1楼 · 编辑于 2024-05-19 14:27:29

您需要使用SAX的具体原因是什么？

因为，如果将整个XML文件加载到内存中的对象模型中是可以接受的，那么您可能会发现使用ElementTree DOM API要容易得多。

（如果在给定子节点时不需要检索父节点的能力，那么Python标准库中的cElementTree应该可以很好地完成这一任务。如果您这样做了，LXML库将提供一个ElementTree实现，它将为您提供父引用。两者都使用编译的C模块来提高速度。）

网友

2楼 · 编辑于 2024-05-19 14:27:29

谢谢或者所有的评论。

我查看了ElementTree的iterparse，但那时我已经用xml.sax编写了一些代码。因为iterparse的直接优势很小，甚至根本不存在，所以我选择只使用xml.sax。与目前的解决方案相比，这已经是一个很大的优势。

好吧，这就是我最后做的。

class Exact(xml.sax.handler.ContentHandler):
    def __init__(self, stdpath):
        self.stdpath = stdpath

        self.thisrow = {}
        self.curpath = []
        self.getvalue = None

        self.conn = MySQLConnect()
        self.table = None
        self.numrows = 0

    def __del__(self):
        self.conn.close()

        print '%s rows affected' % self.numrows

    def startElement(self, name, att):
        self.curpath.append(name)

    def characters(self, data):
        if self.getValue is not None:
            self.thisrow[self.getValue.strip()] = data.strip()
            self.getValue = None

    def endElement(self, name):
        self.curpath.pop()

        if name == self.stdpath[len(self.stdpath) - 1]:
            self.EndRow()
            self.thisrow = { }

    def EndRow(self):
        self.numrows += MySQLInsert(self.conn, self.thisrow, True, self.table)
        #for k, v in self.thisrow.iteritems():
        #   print '%s: %s,' % (k, v),
        #print ''

    def curPath(self, full=False):
        if full:
            return ' > '.join(self.curpath)
        else:
            return ' > '.join(self.curpath).replace(' > '.join(self.stdpath) + ' > ', '')

然后，我对不同的XML文件进行了多次子类化：

class Debtors(sqlimport.Exact):
    def startDocument(self):
        self.table = 'debiteuren'
        self.address = None

    def startElement(self, name, att):
        sqlimport.Exact.startElement(self, name, att)

        if self.curPath(True) == ' > '.join(self.stdpath):
            self.thisrow = {}
            self.thisrow['debiteur'] = att.get('code').strip()
        elif self.curPath() == 'Name':
            self.getValue = 'naam'
        elif self.curPath() == 'Phone':
            self.getValue = 'telefoon1'
        elif self.curPath() == 'ExtPhone':
            self.getValue = 'telefoon2'
        elif self.curPath() == 'Contacts > Contact > Addresses > Address':
            if att.get('type') == 'V':
                self.address = 'Contacts > Contact > Addresses > Address '
        elif self.address is not None:
            if self.curPath() == self.address + '> AddressLine1':
                self.getValue = 'adres1'
            elif self.curPath() == self.address + '> AddressLine2':
                self.getValue = 'adres2'
        else:
            self.getValue = None

if __name__ == '__main__':
    handler = Debtors(['Debtors', 'Accounts', 'Account'])
    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    parser.parse(open('myfile.xml', 'rb'))

。。。等等。。。

网友

3楼 · 编辑于 2024-05-19 14:27:29

我也用过sax，但后来我发现了一个更好的工具：iterparse from ElementTree。

它类似于sax，但是您可以检索包含内容的元素，为了释放内存，您只需在检索到元素后清除它。

相关问题更多 >

编程相关推荐

热门问题

热门文章