使用lxml.etree.iterparse解析损坏的XML

20 投票
3 回答
27376 浏览
提问于 2025-04-15 19:50

我正在尝试用lxml来解析一个很大的xml文件,想要以节省内存的方式进行处理,也就是说,我希望能从磁盘上逐步读取数据,而不是一次性把整个文件都加载到内存里。不过,这个文件里有一些坏的ascii字符,导致默认的解析器无法正常工作。如果我设置recover=True,解析器就能正常工作,但iterparse方法不支持recover参数或者自定义解析器。有没有人知道怎么用iterparse来解析这些有问题的xml文件呢?

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

谢谢你的帮助!

补充说明 -- 这里是我遇到的一些编码错误的例子:

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist  

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}

正如你所看到的,chardet认为这是一个ascii文件,但在这个例子中间有一个"\x1e",这让lxml抛出了异常。

3 个回答

3

请编辑你的问题,说明发生了什么情况(准确的错误信息和追踪记录(复制/粘贴,不要凭记忆输入)),让你觉得“坏的unicode”是问题所在。

获取chardet,然后把你的MySQL导出数据输入给它。告诉我们它的结果。

展示一下你的导出数据的前200到300个字节,比如用print repr(dump[:300])

更新 你写道:“正如你所见,chardet认为这是一个ascii文件,但在这个例子中间有一个“\x1e”,这让lxml抛出了异常。”

我在这里没有看到“坏的unicode”。

chardet是对的。你为什么觉得“\x1e”不是ASCII呢?它是一个ASCII字符,属于C0控制字符,叫做“记录分隔符”。

错误信息说你有一个无效字符,这也是正确的。在XML中,唯一有效的控制字符是"\t""\r""\n"。MySQL应该对此表示不满,或者给你提供一种转义方式,比如_x001e_(呃!)

根据上下文来看,这个字符可以删除而不会造成损失。你可以考虑修复你的数据库,或者从你的导出数据中去掉这些字符(在确认它们都可以删除后),或者选择一种对字符要求不那么严格、输出体积更小的格式,比如CSV。

更新2 你可能想用iterparse(),不是因为这是你的最终目标,而是因为你想节省内存。如果你使用CSV这样的格式,就不会有内存问题。

更新3 针对@Purrell的评论:

自己试试吧,兄弟。 pastie.org/3280965

这是那个pastie的内容;值得保存:

from lxml.etree import etree

data = '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(data), magical_parser)

要让它运行,需要修复一个导入,并提供另一个。数据量巨大。没有输出结果。这里是一个替代方案,数据被缩减到最基本的部分。5个ASCII文本(不包括&lt;&gt;)都是有效的XML字符,被替换为t1、..., t5。有问题的\x1et2t3夹在中间。

[output wraps at column 80]
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> data = '<article>&lt;p&gt;t1&lt;/p&gt;&lt;p&gt;t2\x1et3&lt;/p&gt;&lt;p&gt;t4
&lt;/p&gt;&lt;p&gt;t5&lt;/p&gt;</article>'
>>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
>>> tree = etree.parse(StringIO(data), magical_parser)
>>> print(repr(tree.getroot().text))
'<p>t1</p><p>t2t3/ppt4/ppt5/p'

我不认为这算是“恢复”;在坏字符之后,<>字符消失了。

这个pastie是我问的“你为什么觉得encoding='utf-8'会解决他的问题?”的回应。这是因为有人说“不过有一个‘编码’选项可以解决你的问题。” 但encoding=ascii产生的输出是一样的。省略编码参数也是如此。这不是编码问题。 案子结束。

50

编辑:

这是一个较旧的回答,今天我会用不同的方式来处理这个问题。而且我不是在说那些无聊的讽刺……自那以后,BeautifulSoup4 已经推出了,它真的很不错。我推荐给任何偶然看到这里的人。


目前被接受的答案,其实并不是一个好的做法。这个问题本身也有一个错误的假设:

parser = lxml.etree.XMLParser(recover=True) #可以从错误的字符中恢复。

实际上,recover=True 是用来从 格式错误的 XML 中恢复的。不过,有一个“编码”选项可以解决你的问题。

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

就这样,这就是解决方案。


顺便说一下 -- 对于那些在 Python 中解析 XML 特别是来自第三方来源的人,我知道,文档写得不好,还有很多 StackOverflow 上的误导信息;很多建议都不靠谱。

  • lxml.etree.fromstring()? - 这是用来处理格式完全正确的 XML,傻瓜
  • BeautifulStoneSoup? - 速度慢,而且对于自闭合标签有个非常愚蠢的处理方式
  • lxml.etree.HTMLParser()? - (因为 XML 是坏的)这里有个秘密 - HTMLParser() 实际上是个带有 recover=True 的解析器
  • lxml.html.soupparser? - 这个编码检测应该更好,但在处理自闭合标签时和 BeautifulSoup 一样有问题。也许你可以把 XMLParser 和 BeautifulSoup 的 UnicodeDammit 结合起来使用
  • UnicodeDammit 和其他奇怪的东西来修复编码? - 嗯,UnicodeDammit 名字挺可爱的,我喜欢这个名字,它在 XML 以外的地方也有用,但如果你正确使用 XMLParser(),通常问题就能解决。

你可能会尝试各种网上提供的东西。lxml 的文档可以做得更好。上面的代码是你在 90% 的 XML 解析情况下需要的。这里我再说一遍:

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

不客气。我的头疼等于你的理智。而且它还有其他你可能需要的功能,嗯,关于 XML 的。

7

我通过创建一个类似文件的类来解决这个问题。这个类的 read() 方法会从文件中读取一行,并在返回给 iterparse 之前替换掉任何“坏字符”。

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

我不得不几次修改 myFile 类,增加了一些 replace() 调用,以处理其他一些让 lxml 出错的字符。我觉得 lxml 的 SAX 解析也应该能用(似乎支持恢复选项),但这个解决方案效果很好!

撰写回答