删除元素，但保留其后的文本

2 投票

1 回答

845 浏览

提问于 2025-04-18 02:03

我有一个类似这样的 XML 文件：

<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>

我想要删除所有 <b> 或 <u> 元素（以及它们里面的内容）的文本，然后打印剩下的部分。这是我尝试过的：

from __future__ import print_function
import xml.etree.ElementTree as ET

tree = ET.parse('a.xml')
root = tree.getroot()

parent_map = {c:p for p in root.iter() for c in p}

for item in root.findall('.//b'):
  parent_map[item].remove(item)
for item in root.findall('.//u'):
  parent_map[item].remove(item)
print(''.join(root.itertext()).strip())

（我参考了这个答案的方法来构建 parent_map）。问题是，使用 remove(item) 的时候，我也把元素后面的文本给删除了，结果是：

Some that I

而我想要的结果是：

Some  text that I  want to keep.

有没有什么解决办法？

文本处理 html解析数据清洗元素删除内容过滤文本保留

1 个回答

如果你最终不会使用更好的方法，可以用 clear() 来代替 remove()，这样可以保留元素的尾部：

import xml.etree.ElementTree as ET


data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""

tree = ET.fromstring(data)
a = tree.find('a')
for element in a:
    if element.tag in ('b', 'u'):
        tail = element.tail
        element.clear()
        element.tail = tail

print ET.tostring(tree)

打印出来的结果（可以看到空的 b 和 u 标签）：

<root>
<a>Some <b /> text <i>that</i> I <u /> want to keep.</a>
</root>

另外，这里有一个使用 xml.dom.minidom 的解决方案：

import xml.dom.minidom

data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""

dom = xml.dom.minidom.parseString(data)
a = dom.getElementsByTagName('a')[0]
for child in a.childNodes:
    if getattr(child, 'tagName', '') in ('u', 'b'):
        a.removeChild(child)

print dom.toxml()

打印结果为：

<?xml version="1.0" ?><root>
<a>Some  text <i>that</i> I  want to keep.</a>
</root>

回答于 2025-04-18 由 Python大师

分享举报

删除元素，但保留其后的文本

1 个回答

撰写回答