Scrapy 爬取所有网站地图链接

4 投票

2 回答

13299 浏览

提问于 2025-04-18 01:59

我想要抓取一个固定网站上sitemap.xml里所有的链接。我发现了Scrapy的SitemapSpider。到目前为止，我已经提取出了sitemap里的所有网址。现在我想要逐个访问这些网址。任何帮助都会非常有用。到目前为止的代码是：

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url

2 个回答

你需要添加 sitemap_rules 来处理抓取到的网址数据，而且你可以创建任意数量的规则。举个例子，假设你有一个页面叫做 http://www.xyz.nl//x/，你想为这个页面创建一个规则：

class MySpider(SitemapSpider):
    name = 'xyz'
    sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
    # list with tuples - this example contains one page 
    sitemap_rules = [('/x/', parse_x)]

    def parse_x(self, response):
        sel = Selector(response)
        paragraph = sel.xpath('//p').extract()

        return paragraph

回答于 2025-04-18 由 Python大师

分享举报

简单来说，你可以创建新的请求对象，去访问SitemapSpider生成的那些网址，然后用一个新的回调函数来处理返回的结果。

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url
        return Request(response.url, callback=self.parse_sitemap_url)

    def parse_sitemap_url(self, response):
        # do stuff with your sitemap links

回答于 2025-04-18 由 Python大师

分享举报

Scrapy 爬取所有网站地图链接

2 个回答

撰写回答