Python ElementTree 使用通配符的 find()?

3 投票
1 回答
4560 浏览
提问于 2025-04-17 13:51

我正在用Python解析一个XML数据源,以提取某些标签。我的XML包含命名空间,这导致每个标签都带有一个命名空间,后面跟着标签名称。

这是XML内容:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:rte="http://www.rte.ie/schemas/vod">
    <id>10038711/</id>
    <updated>2013-01-24T22:52:43+00:00</updated>
    <title type="text">Reeling in the Years</title>
    <logo>http://www.rte.ie/iptv/images/logo.gif</logo>
    <link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist?type=iptv&amp;showId=10038711" />
    <category term="feed"/>
    <author>
        <name>RTE</name>
        <uri>http://www.rte.ie</uri>
    </author>
    <entry>
        <id>10038711</id>
        <published>2012-07-04T12:00:00+01:00</published>
        <updated>2013-01-06T12:31:25+00:00</updated>
        <title type="text">Reeling in the Years</title>
        <content type="text">National and international events with popular music from the year 1989.First Broadcast: 08/11/1999</content>
        <category term="WEB Exclusive" rte:type="channel"/>
        <category term="Classics 1980" rte:type="genre"/>
        <category term="rte player" rte:type="source"/>
        <category term="" rte:type="transmision_details"/>
        <category term="False" rte:type="copyprotectionoptout"/>
        <category term="long" rte:type="form"/>
        <category term="3275" rte:type="progid"/>
        <link rel="site" type="text/html" href="http://www.rte.ie/tv50/"/>
        <link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist/?itemId=10038711&amp;type=iptv&amp;format=xml" />
        <link rel="alternate" type="text/html" href="http://www.rte.ie/player/#v=10038711"/>
        <rte:valid start="2012-07-23T15:56:04+01:00" end="2017-08-01T15:56:04+01:00"/>
        <rte:duration ms="842205" formatted="0:10"/>
        <rte:statistics views="19"/>
        <rte:bri id="na"/>
        <rte:channel id="13"/>
        <rte:item id="10038711"/>
        <media:title type="plain">Reeling in the Years</media:title>
        <media:description type="plain">National and international events with popular music from the year 1989. First Broadcast: 08/11/1999</media:description>
        <media:thumbnail url="http://img.rasset.ie/00062efc200.jpg" height="288" width="512" time="00:00:00+00:00"/>
        <media:teaserimgref1x1 url="" time="00:00:00+00:00"/>
        <media:rating scheme="http://www.rte.ie/schemes/vod">NA</media:rating>
        <media:copyright>RTÉ</media:copyright>
        <media:group rte:format="single">
            <media:content url="http://vod.hds.rasset.ie/manifest/2012/0728/20120728_reelingint_cl10038711_10039316_260_.f4m" type="video/mp4" medium="video" expression="full" duration="842205" rte:format="content"/>
        </media:group>
        <rte:ads>
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre2&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre3&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
        </rte:ads>
    </entry>
<!-- playlist.xml -->
</feed>

当我解析这个XML时,每个元素的结果是:

{http://www.w3.org/2005/Atom}id
{http://www.w3.org/2005/Atom}published
{http://www.w3.org/2005/Atom}updated
.....
.....
{http://www.rte.ie/schemas/vod}valid
{http://www.rte.ie/schemas/vod}duration
....
....
{http://search.yahoo.com/mrss/}description
{http://search.yahoo.com/mrss/}thumbnail
....

因为我有3个不同的命名空间,而且我不能保证它们总是一样的,所以我更希望不去硬性指定每个标签,像这样:

for elem in tree.iter({http://www.w3.org/2005/Atom}entry'):
    stream = str(elem.find('{http://www.w3.org/2005/Atom}id').text)
    date_tmp = str(elem.find('{http://www.w3.org/2005/Atom}published').text)
    name_tmp = str(elem.find('{http://www.w3.org/2005/Atom}title').text)
    short_tmp = str(elem.find('{http://www.w3.org/2005/Atom}content').text)
    channel_tmp = elem.find('{http://www.w3.org/2005/Atom}category', "channel")
    channel = str(channel_tmp.get('term'))
    icon_tmp = elem.find('{http://search.yahoo.com/mrss/}thumbnail')
    icon_url = str(icon_tmp.get('url'))

有没有办法可以在查找时使用通配符或者类似的东西,这样就可以忽略命名空间了?

stream = str(elem.find('*id').text)

我可以像上面那样硬编码它们,但我担心将来命名空间会改变,那我的查询就会无法返回数据了。

谢谢你的帮助。

1 个回答

3

你可以使用一个XPath表达式,里面包含local-name()这个函数:

<?xml version="1.0"?>
<root xmlns="ns">
  <tag/>
</root>

假设“doc”是上面XML的元素树:

import lxml.etree
doc = lxml.etree.parse(<some_file_like_object>)
root = doc.getroot()
root.xpath('//*[local-name()="tag"]')
[<Element {ns}tag at 0x7fcde6f7c960>]

根据需要替换<some_file_like_object>(另外,你也可以用lxml.etree.fromstring配合一个XML字符串,直接获取root元素)。

撰写回答