使用lxml获取复杂元素的属性
我有一个简单的XML文件,内容如下:
<brandName type="http://example.com/codes/bmw#" abbrev="BMW" value="BMW" />BMW</brandName>
<maxspeed>
<value>250</value>
<unit type="http://example.com/codes/units#" value="miles per hour" abbrev="mph" />
</maxspeed>
我想用lxml来解析这个文件,并获取其中的值:对于brandName,只需要:
'brand_name' : m.findtext(NS+'brandName')
如果我想获取它的abbrev属性。
'brand_name' : m.findtext(NS+'brandName').attrib['abbrev']
对于maxspeed,我可以通过以下方式获取maxspeed的值:
'maxspeed_value' : m.findtext(NS+'maxspeed/value'),
或者:
'maxspeed_value' : m.find(NS+'maxspeed/value').text,
现在,我想获取里面unit这个属性,我尝试了很多不同的方法,但都失败了。大多数时候出现的错误是:
'NoneType' object has no attribute 'attrib'
以下是我尝试的几种方法,但都没有成功:
'maxspeed_unit' : m.find(NS+'maxspeed/value').attrib['abbrev'],
'maxspeed_unit' : (m.find(NS+'maxspeed/value'))get('abbrev'),
你能给我一些提示,告诉我为什么它不工作吗?非常感谢!
更新的XML内容:
<Car xmlns="http://example.com/vocab/xml/cars#">
<dateStarted>2011-02-05</dateStarted>
<dateSold>2011-02-13</dateSold>
<name type="http://example.com/codes/bmw#" abbrev="X6" value="BMW X6" >BMW X6</name>
<brandName type="http://example.com/codes/bmw#" abbrev="BMW" value="BMW" />BMW</brandName>
<maxspeed>
<value>250</value>
<unit type="http://example.com/codes/units#" value="miles per hour" abbrev="mph" />
</maxspeed>
<route type="http://example.com/codes/routes#" abbrev="HW" value="Highway" >Highway</route>
<power>
<value>180</value>
<unit type="http://example.com/codes/units#" value="powerhorse" abbrev="ph" />
</power>
<frequency type="http://example.com/codes/frequency#" value="daily" >Daily</frequency>
</Car>
2 个回答
0
import lxml.etree as ET
content='''
<Car xmlns="http://example.com/vocab/xml/cars#">
<brandName type="http://example.com/codes/bmw#" abbrev="BMW" value="BMW" >BMW</brandName>
<maxspeed>
<value>250</value>
<unit type="http://example.com/codes/units#" value="miles per hour" abbrev="mph" />
</maxspeed>
</Car>
'''
doc=ET.fromstring(content)
NS = 'http://example.com/vocab/xml/cars#'
# print(ET.tostring(doc,pretty_print=True))
for x in doc.xpath('//ns:maxspeed/ns:unit/@abbrev',namespaces={'ns': NS}):
print(x)
mph
产生
0
在lxml的元素上,.find方法只会查找这个元素的直接子元素。举个例子,在下面这个xml中:
<root>
<brandName type="http://example.com/codes/bmw#" abbrev="BMW" value="BMW">BMW</brandName>
<maxspeed>
<value>250</value>
<unit type="http://example.com/codes/units#" value="miles per hour" abbrev="mph" />
</maxspeed>
</root>
你可以用根元素的.find方法来找到brandname元素或者maxspeed元素,但这个查找不会深入到这些内部元素里。
所以你可以这样做:
root.find('maxspeed').find('unit') #returns the unit Element
从返回的元素中,你可以访问它的属性。
如果你想在整个XML文档中查找所有元素,可以使用.iter()方法。对于之前的例子,你可以这样写:
for element in root.iter(tag='unit'):
print element #This would print all the unit elements in the document.
编辑:这里有一个使用你提供的xml的小示例,功能齐全:
import lxml.etree
from StringIO import StringIO
def ns_join(element, tag, namespace=None):
'''Joins the namespace and tag together, and
returns the fully qualified name.
@param element - The lxml.etree._Element you're searching
@param tag - The tag you're joining
@param namespace - (optional) The Namespace shortname default is None'''
return '{%s}%s' % (element.nsmap[namespace], tag)
def parse_car(element):
'''Parse a car element, This will return a dictionary containing
brand_name, maxspeed_value, and maxspeed_unit'''
maxspeed = element.find(ns_join(element,'maxspeed'))
return {
'brand_name' : element.findtext(ns_join(element,'brandName')),
'maxspeed_value' : maxspeed.findtext(ns_join(maxspeed,'value')),
'maxspeed_unit' : maxspeed.find(ns_join(maxspeed, 'unit')).attrib['abbrev']
}
#Create the StringIO object to feed to the parser.
XML = StringIO('''
<Reports>
<Car xmlns="http://example.com/vocab/xml/cars#">
<dateStarted>2011-02-05</dateStarted>
<dateSold>2011-02-13</dateSold>
<name type="http://example.com/codes/bmw#" abbrev="X6" value="BMW X6" >BMW X6</name>
<brandName type="http://example.com/codes/bmw#" abbrev="BMW" value="BMW" >BMW</brandName>
<maxspeed>
<value>250</value>
<unit type="http://example.com/codes/units#" value="miles per hour" abbrev="mph" />
</maxspeed>
<route type="http://example.com/codes/routes#" abbrev="HW" value="Highway" >Highway</route>
<power>
<value>180</value>
<unit type="http://example.com/codes/units#" value="powerhorse" abbrev="ph" />
</power>
<frequency type="http://example.com/codes/frequency#" value="daily" >Daily</frequency>
</Car>
</Reports>
''')
#Get the root element object of the xml
car_root_element = lxml.etree.parse(XML).getroot()
# For each 'Car' tag in the root element,
# we want to parse it and save the list as cars
cars = [ parse_car(element)
for element in car_root_element.iter() if element.tag.endswith('Car')]
print cars
希望这对你有帮助。