如何在lxml中用文本替换元素?
用lxml的ElementTree API从XML文档中完全删除一个元素是很简单的,但我找不到一个简单的方法来把一个元素替换成文本。例如,给定以下输入:
input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
... 你可以很容易地用下面的代码删除每一个<r>
元素:
from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)
但是,如果你想把每个元素替换成文本,得到这样的输出:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
我觉得,因为ElementTree API通过每个元素的.text
和.tail
属性来处理文本,而不是树中的节点,这就意味着你需要处理很多不同的情况,比如元素是否有兄弟元素,现有元素是否有.tail
属性等等。我是不是错过了什么简单的方法来做到这一点?
3 个回答
4
使用 ET.XSLT:
import io
import lxml.etree as ET
data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
f=ET.fromstring(data)
xslt='''\
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Replace r nodes with DELETED
http://www.w3schools.com/xsl/el_template.asp -->
<xsl:template match="r">DELETED</xsl:template>
<!-- How to copy XML without changes
http://mrhaki.blogspot.com/2008/07/copy-xml-as-is-with-xslt.html -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@*|text()|comment()|processing-instruction">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
'''
xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
f=transform(f)
print(ET.tostring(f))
会得到
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
8
使用 strip_elements
有一个缺点,就是你不能选择性地保留某些 <r>
元素,同时替换掉其他的。此外,它还需要一个 ElementTree
实例(这可能并不总是存在)。最后,你也不能用它来替换 XML 注释或处理指令。
下面的代码应该能帮你解决问题:
for r in f.xpath('//r'):
text = 'DELETED' + r.tail
parent = r.getparent()
if parent is not None:
previous = r.getprevious()
if previous is not None:
previous.tail = (previous.tail or '') + text
else:
parent.text = (parent.text or '') + text
parent.remove(r)
20
我觉得unutbu的XSLT解决方案可能是实现你目标的正确方法。
不过,这里有一种稍微不太正规的办法,可以通过修改<r/>
标签的尾部,然后使用etree.strip_elements
来实现。
from lxml import etree
data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'
etree.strip_elements(f,'r',with_tail=False)
print etree.tostring(f,pretty_print=True)
这样可以得到:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>