如何从XML int Python中获取一些值？

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">  <url> <loc>https://www.nsnam.org/wiki/Main_Page</loc> <lastmod>2018-10-24T03:03:05+00:00</lastmod> <priority>1.00</priority> </url> <url> <loc>https://www.nsnam.org/wiki/Current_Development</loc> <lastmod>2018-10-24T03:03:05+00:00</lastmod> <priority>0.80</priority> </url> <url> <loc>https://www.nsnam.org/wiki/Developer_FAQ</loc> <lastmod>2018-10-24T03:03:05+00:00</lastmod> <priority>0.80</priority> </url>

2条回答

网友

1楼 · 编辑于 2024-05-29 03:10:46

我建议您使用elementtree标准库包：

from xml.etree import ElementTree as ET

SITEMAP = """<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <!  created with Free Online Sitemap Generator www.xml-sitemaps.com  >
    ...
    ...
</urlset>"""

urlset = ET.fromstring(SITEMAP)
loc_elements = urlset.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}loc")
for loc_element in loc_elements:
    print(loc_element.text)

文档链接：

更新：

您的代码出错的是XML名称空间的处理。在
另外，我的示例使用.iter()而不是.findall()/.find()来直接获得loc元素。这可能是好的，也可能不好，这取决于XML的结构和用例。在

网友

2楼 · 编辑于 2024-05-29 03:10:46

你的代码在我这方面运行得很好。您只需在url和loc之前添加：{http://www.sitemaps.org/schemas/sitemap/0.9}

这里：

import os.path
import xml.etree.ElementTree
import requests
from subprocess import call

def creatingListOfBrokenLinks():
    if (os.path.isfile('sitemap.xml')):
        e = xml.etree.ElementTree.parse('sitemap.xml').getroot()
        file = open("all_broken_links.txt", "w")

        for atype in e.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
            r = requests.get(atype.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text)
            print(atype)
            if (r.status_code == 404):
                file.write(atype)

        file.close()


if __name__ == "__main__":
    creatingListOfBrokenLinks()

相关问题更多 >

编程相关推荐

热门问题

热门文章