使用`python的etree.iterparse()`解析大xml文件时出错，代码中是否存在逻辑错误？

Question

我想要解析一个很大的XML文件。这个文件里的记录，比如说，长得像这样。总的来说，文件的结构是这样的：

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

我写了一些代码，目的是从这个文件中选取一些记录。

当我运行这段代码时（整个过程差不多要50分钟，包括把数据存入MySQL数据库），我发现有一条记录似乎有将近一百万个作者。这肯定是错的。我还特意查看了文件，确保里面没有错误。这篇论文只有5或6个作者，所以dblp.xml文件没有问题。所以我猜是我的代码逻辑出了错。但我就是找不到问题出在哪里。也许有人能告诉我，错误在哪里？

代码在这一行停止了：if len(auth) > 2000。

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

编辑：

我在写这个函数的时候完全搞错了。我犯了一个大错误，没注意到在跳过一些不需要的记录时，结果把一些需要的记录也搞混了。在文件的某个点上，我连续跳过了将近一百万条记录，导致后面的需要的记录被搞得很乱。

在John和Paul的帮助下，我成功地重写了我的代码。现在它正在解析，看起来效果不错。如果还有什么意外的错误没有解决，我会再回来报告。否则，非常感谢大家的帮助！我真的很感激！

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context

代码重构数据清洗 xml解析大文件处理数据库存储代码逻辑错误 etree.iterparse 记录跳过

使用`python的etree.iterparse()`解析大xml文件时出错，代码中是否存在逻辑错误？

编辑：

2 个回答

撰写回答