Python中的XML解析器,不去除标签

2024-04-26 00:13:52 发布

您现在位置:Python中文网/ 问答频道 /正文

正在为我的项目处理XML解析器,但不能处理一个问题。你知道吗

这是我的XML文件。我对几个要素感兴趣:句子、句子确定性和ccue。 XML

作为我想要的输出: 确定的,确定的或不确定的 ccue,位于标签内,以及 整个句子(包括或不包括ccue)。你知道吗

我所做的: 导入xml.etree.ElementTree文件作为ET

with open('myfile.xml', 'rt') as f:
tree = ET.parse(f)

for sentence in tree.iter('sentence'):
    certainty = sentence.attrib.get('certainty')
    ccue = sentence.find('ccue')
    if certainty and (ccue is not None):
       print('  %s :: %s :: %s' % (certainty, sentence.text, ccue.text))
    else:
       print('  %s ::,:: %s' % (certainty,sentence.text))

但在这种情况下,CCUE从句子中删除,如果句子不确定,那么它就是不完整的。一旦找到ccue,find函数就会停止。所以如果这句话是:

<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>

它会给我显示:“然而,这个”作为一句话。你知道吗

有人能帮我解决这个问题吗?你也可以帮我把结果保存在CSV里-那太好了。你知道吗

更新 XML示例:

<sentence certainty="certain" id="S1867.2">Left-wing Israelis are open to compromise on the issue, by means such as the monetary reparations and family reunification initiatives offered by Ehud Barak at the Camp David 2000 summit.</sentence>
<sentence certainty="uncertain" id="S1867.3">However, the <ccue>majority of Israelis</ccue> find a comprehensive right of return for Palestinian refugees to be unacceptable.</sentence>
<sentence certainty="certain" id="S1867.4">The HonestReporting organization listed the following grounds for this opposition: Palestinian flight from Israel was not compelled, but voluntary.</sentence>
<sentence certainty="uncertain" id="S1867.5">After seven Arab nations declared war on Israel in 1948, <ccue>many Arab leaders</ccue> encouraged Palestinians to flee, in order to make it easier to rout the Jewish state.</sentence>
<sentence certainty="certain" id="S1867.6">This point, however, is a matter of some contention.</sentence>

Tags: ofthetotextinidforxml
1条回答
网友
1楼 · 发布于 2024-04-26 00:13:52

在XML中,文本可以分成许多text()节点。ElementTree有一个查找所有后代文本节点的调用,这样您就可以将它们粘合在一起。XML中关于文本节点周围的空白应该如何处理存在歧义(它是真实文本的一部分还是仅仅是“漂亮打印”的装饰)。您的示例中有text <ccue>text<ccue> text(注意其中有一个空格太多了),因此我正在剥离它们并添加自己的空格。您可以根据需要调整该部分。你知道吗

# let elementree open and figure out encoding
tree = ET.parse('myfile.xml')

for sentence in tree.iter('sentence'):
    certainty = sentence.attrib.get('certainty', '')
    ccue = sentence.find('ccue')
    if certainty == "uncertain" and ccue:
       text = ' '.join(node.strip() for node in sentence.itertext())
       print('  %s :: %s :: %s' % (certainty, text, ccue.text))
    else:
       print('  %s ::,:: %s' % (certainty,sentence.text))

相关问题 更多 >