如何使用lxml选择和更新混合内容中的文本节点?

2024-04-27 21:57:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要检查XML文件中所有text()节点中的所有单词。我使用XPath//text()来选择文本节点,使用regex来选择单词。如果单词存在于一组关键字中,我需要用一些东西替换它并更新XML

通常设置元素的文本是使用.text完成的,但是{}元素上的.text只会更改第一个子文本节点。在mixed content element中,其他文本节点实际上是它前面兄弟节点的.tail

如何更新所有文本节点

在下面的简化示例中,我只是尝试将匹配的关键字用方括号括起来

输入XML

<doc>
    <para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
        and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>

所需输出

<doc>
    <para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
        and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>

Tags: thetext文本redoc节点ishave
1条回答
网友
1楼 · 发布于 2024-04-27 21:57:20

我在文档中找到了这个解决方案的关键:Using XPath to find text

特别是_ElementUnicodeResultis_textis_tail属性

使用这些属性,我可以判断是否需要更新父级_Element.text.tail属性

这一点一开始理解起来有点棘手,因为当您在文本节点(_ElementUnicodeResult)上使用getparent()时,前面的同级是作为父节点返回的;不是真正的父母

示例

Python

import re
from lxml import etree

xml = """<doc>
    <para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
        and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""


def update_text(match, word_list):
    if match in word_list:
        return f"[{match}]"
    else:
        return match


root = etree.fromstring(xml)

keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}

for text in root.xpath("//text()"):
    parent = text.getparent()
    updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
    if text.is_text:
        parent.text = updated_text
    elif text.is_tail:
        parent.tail = updated_text

etree.dump(root)

输出(转储到控制台)

<doc>
    <para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
        and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>

相关问题 更多 >