如何在python中使用正则表达式删除XML标记?

2024-06-16 11:29:45 发布

您现在位置:Python中文网/ 问答频道 /正文

python中的字符串可以包含某些纯文本,也可以包含一些包含特定信息的XML标记。例如:

The student XYZ abc has been terminated from the institute. 
you can find the details of student below:
<info StatusCode="End">
    <user_detail>
        <name>
            <first_name>ABC</first_name>
            <last_name>XYZ</last_name>
        </name>
        <contact_details>
            <contact_number>
                <number_type>landline</number_type>
                <number>1234567</number>
            </contact_number>
            <address>
                <address_field1> lorem ipsum, qwerty </address_field1>
                <address_field2> lorem ipsum2, qwerty2 </address_field2>
                <city> asdfgh </city>
                <state> zxcvbn </state>
                <country> India </country>
            </address>
        </contact_details>
    </user_detail>
    <flight_detail>
        ...
    </flight_detail>
</info>
Lorem ipsum dolor sit amet, pro ea dicat velit regione, modo putant 
sensibus pri id, ut bonorum scripserit sit. Ex nec tation alienum, est ut 
nemore efficiendi interpretaris, vis te reque eleifend. 
<xml_tag>
...
</xml_tag>
Laudem delectus
reprehendunt ei mei, has nisl dolorem mnesarchum no, ad eos modo singulis
euripidis. Quo no consul offendit. Eu alia utroque argumentum vix, no 
case primis eum.
<xml_tag>
....
</xml_tag>

XML的开始标记是<info>,它可以是<session StatusCode="End">的任何形式,在这种情况下,结束标记将是</session>。 目前,我正在使用

^{pr2}$

但是,现在我想从中删除所有XML内容文本。那个我现在想要的最终输出是:

The student XYZ abc has been terminated from the institute. 
you can find the details of student below:
Lorem ipsum dolor sit amet, pro ea dicat velit regione, modo putant 
sensibus pri id, ut bonorum scripserit sit. Ex nec tation alienum, est ut 
nemore efficiendi interpretaris, vis te reque eleifend. 
Laudem delectus
reprehendunt ei mei, has nisl dolorem mnesarchum no, ad eos modo singulis
euripidis. Quo no consul offendit. Eu alia utroque argumentum vix, no 
case primis eum. 

我尝试过使用</\S+>进行匹配,但它将在第一个结束XML标记之前删除。如何从纯文本字符串中删除也可以包含简单文本的所有XML内容。在


Tags: thenoname标记文本numberaddresscontact