从XML的etree中获取单个元素的文本

2 投票
1 回答
713 浏览
提问于 2025-04-17 08:53

下面的代码运行得很好,但有没有更符合Python风格的方法来实现相同的功能呢?我只是想解析XML,并从几个元素中获取文本(比如名字、状态和网址)。

from lxml import etree
from urllib2 import urlopen

def ask_CoL(url):
    tree = etree.parse(urlopen(url))
    tn=[ el.get('total_number_of_results') for el in tree.iter('results') ]
    try:
        nr = int(tn[0])
    except ValueError:
        nr = 0
    if nr == 1:
        newstr = str([ el.text for el in tree.getiterator(tag='name')])\
                                             .strip("[]'")+','\
                +str([ el.text for el in tree.getiterator(tag='name_status')])\
                                             .strip("[]'")+','\
                +str([ el.text for el in tree.getiterator(tag='url')])\
                                             .strip("[]'")+'\n'
    else:
        newstr = 'NA\n'
    return newstr

示例XML:

<results id="" name="Theragra chalcogramma" total_number_of_results="1" number_of_results_returned="1" start="0" error_message="" version="1.6 rev 1152">
  <result>
    <id>9037795</id>
    <name>Theragra chalcogramma</name>
    <rank>Species</rank>
    <name_status>accepted name</name_status>
    <online_resource>http://www.fishbase.org/Summary/SpeciesSummary.php?ID=318</online_resource>
    <source_database>FishBase</source_database>
    <source_database_url>http://www.fishbase.org</source_database_url>
    <name_html><i>Theragra chalcogramma</i> (Pallas, 1814)</name_html>
    <url>http://www.catalogueoflife.org/col/details/species/id/9037795</url>
  </result>
</results>

1 个回答

1

你可以简化接口和实现的部分:

import urllib2
from xml.etree import cElementTree as etree

def f(url):
    tree = etree.parse(urllib2.urlopen(url))         
    el = tree.find('results')
    if el is not None:
       lst = [el.findtext(tag) or '' for tag in "name name_status url".split()]
       return ','.join(lst) 

撰写回答