使用`python的etree.iterparse()`解析大xml文件时出错,代码中是否存在逻辑错误?
我想要解析一个很大的XML文件。这个文件里的记录,比如说,长得像这样。总的来说,文件的结构是这样的:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
record_1
...
record_n
</dblp>
我写了一些代码,目的是从这个文件中选取一些记录。
当我运行这段代码时(整个过程差不多要50分钟,包括把数据存入MySQL数据库),我发现有一条记录似乎有将近一百万个作者。这肯定是错的。我还特意查看了文件,确保里面没有错误。这篇论文只有5或6个作者,所以dblp.xml文件没有问题。所以我猜是我的代码逻辑出了错。但我就是找不到问题出在哪里。也许有人能告诉我,错误在哪里?
代码在这一行停止了:if len(auth) > 2000
。
import sys
import MySQLdb
from lxml import etree
elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]
def fast_iter(context, cursor):
mydict = {} # represents a paper with all its tags.
auth = [] # a list of authors who have written the paper "together".
counter = 0 # counts the papers
for event, elem in context:
if elem.tag in elements and event == "start":
mydict["element"] = elem.tag
mydict["mdate"] = elem.get("mdate")
mydict["key"] = elem.get("key")
elif elem.tag == "title" and elem.text != None:
mydict["title"] = elem.text
elif elem.tag == "booktitle" and elem.text != None:
mydict["booktitle"] = elem.text
elif elem.tag == "year" and elem.text != None:
mydict["year"] = elem.text
elif elem.tag == "journal" and elem.text != None:
mydict["journal"] = elem.text
elif elem.tag == "author" and elem.text != None:
auth.append(elem.text)
elif event == "end" and elem.tag in elements:
counter += 1
print counter
#populate_database(mydict, auth, cursor)
mydict.clear()
auth = []
if mydict or auth:
sys.exit("Program aborted because auth or mydict was not deleted properly!")
if len(auth) > 200: # There are up to ~150 authors per paper.
sys.exit("auth: It seams there is a paper which has too many authors.!")
if len(mydict) > 50: # A paper can have much metadata.
sys.exit("mydict: It seams there is a paper which has too many tags.")
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def main():
cursor = connectToDatabase()
cursor.execute("""SET NAMES utf8""")
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
fast_iter(context, cursor)
cursor.close()
if __name__ == '__main__':
main()
编辑:
我在写这个函数的时候完全搞错了。我犯了一个大错误,没注意到在跳过一些不需要的记录时,结果把一些需要的记录也搞混了。在文件的某个点上,我连续跳过了将近一百万条记录,导致后面的需要的记录被搞得很乱。
在John和Paul的帮助下,我成功地重写了我的代码。现在它正在解析,看起来效果不错。如果还有什么意外的错误没有解决,我会再回来报告。否则,非常感谢大家的帮助!我真的很感激!
def fast_iter2(context, cursor):
elements = set([
'article', 'inproceedings', 'proceedings', 'book', 'incollection',
'phdthesis', "mastersthesis", "www"
])
childElements = set(["title", "booktitle", "year", "journal", "ee"])
paper = {} # represents a paper with all its tags.
authors = [] # a list of authors who have written the paper "together".
paperCounter = 0
for event, element in context:
tag = element.tag
if tag in childElements:
if element.text:
paper[tag] = element.text
# print tag, paper[tag]
elif tag == "author":
if element.text:
authors.append(element.text)
# print "AUTHOR:", authors[-1]
elif tag in elements:
paper["element"] = tag
paper["mdate"] = element.get("mdate")
paper["dblpkey"] = element.get("key")
# print tag, element.get("mdate"), element.get("key"), event
if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
pass
else:
populate_database(paper, authors, cursor)
paperCounter += 1
print paperCounter
paper = {}
authors = []
# if paperCounter == 100:
# break
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
del context
2 个回答
在你检测到元素中标签开始和结束的代码块里,加上打印语句,这样可以确保你正确地检测到这些标签。我怀疑你可能因为某种原因没有执行到清空作者列表的那段代码。
试着把这段代码注释掉(或者至少把它移到“结束”处理的部分):
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
在你遍历XML的时候,Python应该会自动帮你清空这些元素。“del context”这一行其实是多余的。让引用计数来帮你处理这些事情就可以了。
请消除一个混淆点:你并没有明确说明你展示的代码是否在“计数超过2000”的测试中出错。如果没有问题,那可能是你没有展示的数据库更新代码出了问题。
如果确实出错了:
(1) 把限制从2000降低到合理的值(比如auth
大约20,mydict
正好7)。
(2) 当出错时,使用print repr(mydict); print; print repr(auth)
来打印内容,并与文件中的内容进行比较。
另外,使用iterparse()时,elem.text在“开始”事件发生时并不一定已经被解析。为了节省一些运行时间,你应该在“结束”事件发生时再访问elem.text。实际上,似乎没有必要在“开始”事件时做任何事情。此外,你定义了一个列表tags
,但从未使用过。你函数的开头可以写得更简洁:
def fast_iter(context, cursor):
mydict = {} # represents a paper with all its tags.
auth = [] # a list of authors who have written the paper "together".
counter = 0 # counts the papers
tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
tagset2 = set(["title", "booktitle", "year", "journal"])
for event, elem in context:
tag = elem.tag
if tag in tagset2:
if elem.text:
mydict[tag] = elem.text
elif tag == "author":
if elem.text:
auth.append(elem.text)
elif tag in tagset1:
counter += 1
print counter
mydict["element"] = tag
mydict["mdate"] = elem.get("mdate")
mydict["dblpkey"] = elem.get("key")
#populate_database(mydict, auth, cursor)
mydict.clear() # Why not just do mydict = {} ??
auth = []
# etc etc
别忘了修正iterparse()的调用,去掉events参数。
我还相当确定elem.clear()应该只在事件为“结束”时执行,并且只在tag in tagset1
时执行。请仔细阅读相关文档。在“开始”事件中进行清理可能会损坏你的树结构。