Python LXML 迭代解析带嵌套元素

2 投票

1 回答

1623 浏览

提问于 2025-04-16 16:06

我想从一个XML文件中获取特定元素的内容。不过，在这个XML元素里面，还有其他的XML元素，这让我们很难正确提取父标签中的内容。举个例子：

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
  print element.text

这样就会得到：

a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None

但是，比如说，'a protective uniform for use ..' 这个内容就被漏掉了。看起来，每个包含其他内部元素的 'claim-text' 元素都被忽略了。我应该如何改变解析XML的方式，以便获取所有的声明内容呢？

谢谢

我刚刚用一种'普通'的SAX解析器的方法解决了这个问题：

class SimpleXMLHandler(object):

  def __init__(self):
    self.buffer = ''
    self.claim = 0

  def start(self, tag, attributes):
    if tag == 'claim-text':
      if self.claim == 0:
        self.buffer = ''
      self.claim = 1

  def data(self, data):
    if self.claim == 1:
      self.buffer += data

  def end(self, tag):
    if tag == 'claim-text':
      print self.buffer
      self.claim = 0

  def close(self):
    pass

数据提取 xml解析嵌套元素 sax解析器文本内容 lxml库声明内容

1 个回答

你可以使用一种叫做xpath的方式，来找到每个<claim-text>节点下面的所有文本内容，并把它们连接在一起，方法如下：

from StringIO import StringIO
from lxml import etree
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text')
for event, element in context:
  print ''.join(element.xpath('text()'))

这样做的结果是：

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising:  
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;

回答于 2025-04-16 由 Python大师

分享举报

Python LXML 迭代解析带嵌套元素

1 个回答

撰写回答