如何使用Python中的lxml剥离XML标签中的所有子标签,但保留文本以合并到父标签中?

1 投票
2 回答
2547 浏览
提问于 2025-04-16 21:02

怎么让 etree.strip_tags() 去掉某个标签元素下的所有可能的标签呢?

我是不是得自己一个个列出来,比如:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

有没有更优雅的方法我不知道呢?

示例输入:

parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

想要的输出:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

或者更好的是:

This is some text with multiple tags and sometimes they are nested.

2 个回答

3

这个回答有点晚了,但我觉得提供一个比最初的回答者ars更简单的解决方案,可能会对你有帮助。

简短回答

在调用strip_tags()时,使用"*"这个参数,可以指定要去掉所有的标签。

详细回答

根据你的XML字符串,我们可以创建一个lxml元素

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

你可以这样检查这个实例:

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

要去掉除了parent标签以外的所有标签,可以像你建议的那样使用etree.strip_tags()函数,但要加上"*"这个参数:

>>> lxml.etree.strip_tags(parent_tag, "*")

检查后发现所有子标签都被去掉了:

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

这就是你想要的输出。注意,这会修改lxml元素实例本身!为了让它更好(就像你要求的那样 :-)),只需获取text属性:

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
5

你可以使用 lxml.html.clean 这个模块来处理HTML内容:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

撰写回答