用Python解析非标准XML文件
我的输入文件实际上是多个XML文件合并成一个文件。(这个文件来自Google Patents)。它的结构如下:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
Python的xml.dom.minidom无法解析这个非标准的文件。有没有更好的方法来解析这个文件?我不确定下面的代码性能好不好。
for line in infile:
if line == '<?xml version="1.0" encoding="UTF-8"?>':
xmldoc = minidom.parse(XMLstring)
else:
XMLstring += line
3 个回答
0
我对minidom不太了解,也不太懂XML解析,但我用过XPath来解析XML和HTML。比如在lxml模块中。
这里有一些XPath的例子可以参考:http://www.w3schools.com/xpath/xpath_examples.asp
2
我建议你把每一块XML单独解析。
看起来你在示例代码中已经在这么做了。以下是我对你代码的看法:
def parse_xml_buffer(buffer):
dom = minidom.parseString("".join(buffer)) # join list into string of XML
# .... parse dom ...
buffer = [file.readline()] # initialise with the first line
for line in file:
if line.startswith("<?xml "):
parse_xml_buffer(buffer)
buffer = [] # reset buffer
buffer.append(line) # list operations are faster than concatenating strings
parse_xml_buffer(buffer) # parse final chunk
一旦你把文件分解成单独的XML块,实际的解析方式就取决于你的需求,以及你个人的偏好。有几个选择,比如lxml、minidom、elementtree、expat、BeautifulSoup等等。
更新:
从头开始,我会这样做(使用BeautifulSoup):
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
for xml_string in separated_xml("ipgb20110104.xml"):
soup = BeautifulSoup(xml_string)
for num in soup.findAll("doc-number"):
print num.contents[0]
这将返回:
D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...
6
这是我对这个问题的看法,我使用了一个生成器和 lxml.etree
。提取的信息只是为了举例。
import urllib2, os, zipfile
from lxml import etree
def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
buff = []
for line in data:
if separator(line):
if buff:
yield ''.join(buff)
buff[:] = []
buff.append(line)
yield ''.join(buff)
def first(seq,default=None):
"""Return the first item from sequence, seq or the default(None) value"""
for item in seq:
return item
return default
datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]
if not os.path.exists(filename):
with open(filename,'wb') as file_write:
r = urllib2.urlopen(datasrc)
file_write.write(r.read())
zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None
count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 10: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
print "DocID: {0}\nTitle: {1}\nAssignee: {2}\n".format(docID,title,assignee)
输出结果:
DocID: US-D0629996-S1-20110104 Title: Glove backhand Assignee: Blackhawk Industries Product Group Unlimited LLC DocID: US-D0629997-S1-20110104 Title: Belt sleeve Assignee: None DocID: US-D0629998-S1-20110104 Title: Underwear Assignee: X-Technology Swiss GmbH DocID: US-D0629999-S1-20110104 Title: Portion of compression shorts Assignee: Nike, Inc. DocID: US-D0630000-S1-20110104 Title: Apparel Assignee: None DocID: US-D0630001-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630002-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630003-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630004-S1-20110104 Title: Headwear cap Assignee: None DocID: US-D0630005-S1-20110104 Title: Footwear Assignee: Vibram S.p.A.