在Python的lxm中修复tostring（）

2024-04-26 17:34:00 发布

男 | 程序猿一只，喜欢编程写python代码。

当只打印部分文档时，lxml的tostring()函数似乎很坏。见证人：

from lxml.html import fragment_fromstring, tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em)

我期望<em>really</em>，但它却打印出<em>really</em> great!，这是错误的。“太棒了！”不是所选em的一部分。这不仅是错误的，而且是一个药丸，至少对于处理文档结构的XML来说是如此，在这里这样的尾随文本将很常见。

据我所知，lxml在元素的.tail属性中存储当前元素之后的任何自由文本。对tostring()代码的扫描使我想到ElementTree.py的_write()函数，它显然总是打印尾部。对于整棵树来说，这是正确的行为，但是在呈现子树时，不会在最后一个元素上执行，但是它没有区别。

为了获得所选XML的正确无尾呈现，我尝试从头开始编写一个toxml()函数来代替它。它基本上可以工作，但是在处理注释、处理指令、名称空间、编码、yadda-yadda方面有很多特殊情况。所以我换了档，现在只是背着tostring()，对其输出进行后处理，以删除有问题的.tail文本：

def toxml(e):
    """ Replacement for lxml's tostring() method that doesn't add spurious
    tail text. """

    from lxml.etree import tostring
    xml = tostring(e)
    if e.tail:
        xml = xml[:-len(e.tail)]
    return xml

一系列基本的测试表明，这种方法工作得很好。

评论和/或建议？

Tags：函数 from 文档文本 import 元素 xml lxml

0条回答

目前没有回答

在Python的lxm中修复tostring（）

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Python的lxm中修复tostring（）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >