如何使用Python中的lxml剥离XML标签中的所有子标签,但保留文本以合并到父标签中?
怎么让 etree.strip_tags()
去掉某个标签元素下的所有可能的标签呢?
我是不是得自己一个个列出来,比如:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
有没有更优雅的方法我不知道呢?
示例输入:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
想要的输出:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
或者更好的是:
This is some text with multiple tags and sometimes they are nested.
2 个回答
3
这个回答有点晚了,但我觉得提供一个比最初的回答者ars更简单的解决方案,可能会对你有帮助。
简短回答
在调用strip_tags()
时,使用"*"
这个参数,可以指定要去掉所有的标签。
详细回答
根据你的XML字符串,我们可以创建一个lxml元素:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
你可以这样检查这个实例:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
要去掉除了parent
标签以外的所有标签,可以像你建议的那样使用etree.strip_tags()
函数,但要加上"*"
这个参数:
>>> lxml.etree.strip_tags(parent_tag, "*")
检查后发现所有子标签都被去掉了:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
这就是你想要的输出。注意,这会修改lxml元素实例本身!为了让它更好(就像你要求的那样 :-)),只需获取text
属性:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
5
你可以使用 lxml.html.clean
这个模块来处理HTML内容:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>