在Python中使用xml airnow.g解析数据

import system import xml.dom.minidom url = "http://feeds.enviroflash.info/rss/realtime/133.xml" response = system.net.httpGet(url) dom = xml.dom.minidom.parseString(response) for tag in dom.getElementsByTagName("*"): print tag.firstChild.data

<rss version="2.0"> <channel> <title>San Francisco, CA - Current Air Quality</title> <link>http://www.airnow.gov/</link> <description>EnviroFlash RSS Feed</description> <language>en-us</language> <webMaster> airnowdmc@sonomatech.com (AIRNow Data Management Center) </webMaster> <pubDate>Thu, 12 Oct 2017 08:45:10 PDT</pubDate> <item> <title>San Francisco, CA - Current Air Quality</title> <link> http://feeds.enviroflash.info/rss/realtime/133.xml?id=AC9AF12B-02F4-5A9E-BD504999C6EF606E </link> <description>  <div xmlns="http://www.w3.org/1999/xhtml"> <table style="width: 350px;"> <tr> <td> <br> </td> </tr> <tr> <td valign="top"> <div><b>Location:</b> San Francisco, CA</div><br /> <div> <b>Current Air Quality:</b> 10/12/17 8:00 AM PDT<br /><br /> <div> Unhealthy - 156 AQI - Particle Pollution (2.5 microns)<br /> <br /> Good - 1 AQI - Ozone<br /> <br /> </div> </div> <div><b>Agency:</b> San Francisco Bay Area AQMD </div><br /> <div><i>Last Update: Thu, 12 Oct 2017 08:45:10 PDT</i></div> </td> </tr> </table> </div> </description> </item> </channel> </rss>

San Francisco, CA - Current Air Quality http://www.airnow.gov/ EnviroFlash RSS Feed en-us airnowdmc@sonomatech.com (AIRNow Data Management Center) Thu, 12 Oct 2017 08:45:10 PDT San Francisco, CA - Current Air Quality http://feeds.enviroflash.info/rss/realtime/133.xml?id=AC9AF12B-02F4-5A9E-BD504999C6EF606E

1条回答

网友

1楼 · 发布于 2024-04-24 23:12:47

第一个HTML不是XML。因此，请考虑使用BeautifulSoup来做同样的事情，以类似的方式。例如，<br>是一个有效的标记，在html中没有任何匹配的结束标记。但是xml解析器会抛出一个错误。你知道吗

那就是说你看下图：-你知道吗

#Will give you all text in the html, your codes attempt
for tag in dom.getElementsByTagName("*"):
    if tag.firstChild and not isinstance(tag.firstChild,xml.dom.minidom.Element) :
        if(len(tag.firstChild.data.strip())>0):
            print tag.firstChild.wholeText
print('\n\n\n')
#Will give you text from just the second description.
#I believe all parts here are important like time/place/last-update etc..
desc=dom.getElementsByTagName("description")[1]
for tag in desc.getElementsByTagName("*"):
    for node in tag.childNodes:
        if( isinstance(node,xml.dom.minidom.Text) and len(node.data.strip())>0):
            print node.data

希望你能想出如何得到Location: San Francisco, CA而不是San Francisco, CA Location:

相关问题更多 >

编程相关推荐

热门问题

热门文章