从XML的etree中获取单个元素的文本
下面的代码运行得很好,但有没有更符合Python风格的方法来实现相同的功能呢?我只是想解析XML,并从几个元素中获取文本(比如名字、状态和网址)。
from lxml import etree
from urllib2 import urlopen
def ask_CoL(url):
tree = etree.parse(urlopen(url))
tn=[ el.get('total_number_of_results') for el in tree.iter('results') ]
try:
nr = int(tn[0])
except ValueError:
nr = 0
if nr == 1:
newstr = str([ el.text for el in tree.getiterator(tag='name')])\
.strip("[]'")+','\
+str([ el.text for el in tree.getiterator(tag='name_status')])\
.strip("[]'")+','\
+str([ el.text for el in tree.getiterator(tag='url')])\
.strip("[]'")+'\n'
else:
newstr = 'NA\n'
return newstr
示例XML:
<results id="" name="Theragra chalcogramma" total_number_of_results="1" number_of_results_returned="1" start="0" error_message="" version="1.6 rev 1152">
<result>
<id>9037795</id>
<name>Theragra chalcogramma</name>
<rank>Species</rank>
<name_status>accepted name</name_status>
<online_resource>http://www.fishbase.org/Summary/SpeciesSummary.php?ID=318</online_resource>
<source_database>FishBase</source_database>
<source_database_url>http://www.fishbase.org</source_database_url>
<name_html><i>Theragra chalcogramma</i> (Pallas, 1814)</name_html>
<url>http://www.catalogueoflife.org/col/details/species/id/9037795</url>
</result>
</results>
1 个回答
1
你可以简化接口和实现的部分:
import urllib2
from xml.etree import cElementTree as etree
def f(url):
tree = etree.parse(urllib2.urlopen(url))
el = tree.find('results')
if el is not None:
lst = [el.findtext(tag) or '' for tag in "name name_status url".split()]
return ','.join(lst)