在lxm中迭代文本和元素

2024-06-01 00:01:29 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有以下XML文档:

<species>
    Mammals: <dog/> <cat/>
    Reptiles: <snake/> <turtle/>
    Birds: <seagull/> <owl/>
</species>

然后我得到species元素,如下所示:

^{pr2}$

现在我想打印一份按物种分组的动物名单。我如何使用ElementTree API来实现呢?在


Tags: 文档元素物种xmlowlcatspeciesturtle
2条回答

如果枚举所有节点,您将看到一个文本节点,该节点具有类,后跟具有种类的元素节点:

>>> for node in species.xpath("child::node()"):
...     print type(node), node
... 
<class 'lxml.etree._ElementStringResult'> 
    Mammals: 
<type 'lxml.etree._Element'> <Element dog at 0xe0b3c0>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element cat at 0xe0b410>
<class 'lxml.etree._ElementStringResult'> 
    Reptiles: 
<type 'lxml.etree._Element'> <Element snake at 0xe0b460>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element turtle at 0xe0b4b0>
<class 'lxml.etree._ElementStringResult'> 
    Birds: 
<type 'lxml.etree._Element'> <Element seagull at 0xe0b500>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element owl at 0xe0b550>
<class 'lxml.etree._ElementStringResult'> 

所以你可以从那里建造它:

^{pr2}$

结果

{'Mammals': ['dog', 'cat'], 'Reptiles': ['snake', 'turtle'], 'Birds': ['seagull', 'owl']}

这都是脆弱的。。。在文本节点的排列方式上做一些小的改变会使解析变得混乱。在

设计说明

@tdelaney的答案基本上是正确的,但我想指出Python元素树API的一个细微差别。以下是the ^{} tutorial中的一句话:

Elements can contain text:

<root>TEXT</root>

In many XML documents (data-centric documents), this is the only place where text can be found. It is encapsulated by a leaf tag at the very bottom of the tree hierarchy.

However, if XML is used for tagged text documents such as (X)HTML, text can also appear between different elements, right in the middle of the tree:

<html><body>Hello<br/>World</body></html>

Here, the <br/> tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree.

The two properties text and tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

实施

考虑到这些属性,可以在不强制树输出文本节点的情况下检索文档文本。在

#!/usr/bin/env python3.3


import itertools
from pprint import pprint

try:
  from lxml import etree
except ImportError:
  from xml.etree import cElementTree as etree


def textAndElement(node):
  '''In py33+ recursive generators are easy'''

  yield node

  text = node.text.strip() if node.text else None
  if text:
    yield text

  for child in node:
    yield from textAndElement(child)

  tail = node.tail.strip() if node.tail else None
  if tail:
    yield tail


if __name__ == '__main__':
  xml = '''
    <species>
      Mammals: <dog/> <cat/>
      Reptiles: <snake/> <turtle/>
      Birds: <seagull/> <owl/>
    </species>
  '''
  doc = etree.fromstring(xml)

  pprint(list(textAndElement(doc)))
  #[<Element species at 0x7f2c538727d0>,
  #'Mammals:',
  #<Element dog at 0x7f2c538728c0>,
  #<Element cat at 0x7f2c53872910>,
  #'Reptiles:',
  #<Element snake at 0x7f2c53872960>,
  #<Element turtle at 0x7f2c538729b0>,
  #'Birds:',
  #<Element seagull at 0x7f2c53872a00>,
  #<Element owl at 0x7f2c53872a50>]

  gen = textAndElement(doc)
  next(gen) # skip root
  groups = []
  for _, g in itertools.groupby(gen, type):
    groups.append(tuple(g))

  pprint(dict(zip(*[iter(groups)] * 2)) )
  #{('Birds:',): (<Element seagull at 0x7fc37f38aaa0>,
  #               <Element owl at 0x7fc37f38a820>),
  #('Mammals:',): (<Element dog at 0x7fc37f38a960>,
  #                <Element cat at 0x7fc37f38a9b0>),
  #('Reptiles:',): (<Element snake at 0x7fc37f38aa00>,
  #                <Element turtle at 0x7fc37f38aa50>)}

相关问题 更多 >