用Python的lxml去掉内联标签

5 投票
1 回答
4890 浏览
提问于 2025-04-16 20:15

我需要处理XML文档中的两种内联标签。第一种标签是用来包裹我想保留的文本的。我可以用lxml来处理这个。

etree.tostring(element, method="text", encoding='utf-8')

第二种标签则包含我不想保留的文本。我该怎么去掉这些标签和它们的文本呢?如果可以的话,我希望不使用正则表达式。

谢谢

1 个回答

13

我觉得 strip_tagsstrip_elements 是你在每种情况下都需要的工具。比如,这段脚本:

from lxml import etree

text = "<x>hello, <z>keep me</z> and <y>ignore me</y>, and here's some <y>more</y> text</x>"

tree = etree.fromstring(text)

print etree.tostring(tree, pretty_print=True)

# Remove the <z> tags, but keep their contents:
etree.strip_tags(tree, 'z')

print '-' * 72
print etree.tostring(tree, pretty_print=True)

# Remove all the <y> tags including their contents:
etree.strip_elements(tree, 'y', with_tail=False)

print '-' * 72
print etree.tostring(tree, pretty_print=True)

... 会产生以下输出:

<x>hello, <z>keep me</z> and <y>ignore me</y>, and
here's some <y>more</y> text</x>

------------------------------------------------------------------------
<x>hello, keep me and <y>ignore me</y>, and
here's some <y>more</y> text</x>

------------------------------------------------------------------------
<x>hello, keep me and , and
here's some  text</x>

撰写回答