如何在lxml中用文本替换元素?

13 投票
3 回答
12012 浏览
提问于 2025-04-16 14:19

用lxml的ElementTree API从XML文档中完全删除一个元素是很简单的,但我找不到一个简单的方法来把一个元素替换成文本。例如,给定以下输入:

input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

... 你可以很容易地用下面的代码删除每一个<r>元素:

from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
    r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)

但是,如果你想把每个元素替换成文本,得到这样的输出:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

我觉得,因为ElementTree API通过每个元素的.text.tail属性来处理文本,而不是树中的节点,这就意味着你需要处理很多不同的情况,比如元素是否有兄弟元素,现有元素是否有.tail属性等等。我是不是错过了什么简单的方法来做到这一点?

3 个回答

4

使用 ET.XSLT

import io
import lxml.etree as ET

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f=ET.fromstring(data)
xslt='''\
    <xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">    

    <!-- Replace r nodes with DELETED
         http://www.w3schools.com/xsl/el_template.asp -->
    <xsl:template match="r">DELETED</xsl:template>

    <!-- How to copy XML without changes
         http://mrhaki.blogspot.com/2008/07/copy-xml-as-is-with-xslt.html -->    
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="@*|text()|comment()|processing-instruction">
        <xsl:copy-of select="."/>
    </xsl:template>
    </xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
f=transform(f)

print(ET.tostring(f))

会得到

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
8

使用 strip_elements 有一个缺点,就是你不能选择性地保留某些 <r> 元素,同时替换掉其他的。此外,它还需要一个 ElementTree 实例(这可能并不总是存在)。最后,你也不能用它来替换 XML 注释或处理指令。

下面的代码应该能帮你解决问题:

for r in f.xpath('//r'):
    text = 'DELETED' + r.tail 
    parent = r.getparent()
    if parent is not None:
        previous = r.getprevious()
        if previous is not None:
            previous.tail = (previous.tail or '') + text
        else:
            parent.text = (parent.text or '') + text
        parent.remove(r)
20

我觉得unutbu的XSLT解决方案可能是实现你目标的正确方法。

不过,这里有一种稍微不太正规的办法,可以通过修改<r/>标签的尾部,然后使用etree.strip_elements来实现。

from lxml import etree

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f = etree.fromstring(data)
for r in f.xpath('//r'):
  r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'

etree.strip_elements(f,'r',with_tail=False)

print etree.tostring(f,pretty_print=True)

这样可以得到:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

撰写回答