scrapy：如何让响应被多个解析函数解析？

1 投票

1 回答

1179 浏览

提问于 2025-04-18 17:36

我想对每一个在 start_urls 里的链接做一些特别的处理，然后让爬虫继续跟踪所有的下一页，深入爬取。所以我的代码大概是这样的：

def init_parse(self, response):
    item = MyItem()

    # extract info from the landing url and populate item fields here...

    yield self.newly_parse(response)
    yield item
    return


parse_start_url = init_parse


def newly_parse(self, response):
    item = MyItem2()
    newly_soup = BeautifulSoup(response.body)

    # parse, return or yield items

    return item

这段代码不能正常工作，因为爬虫只允许返回项目、请求或者 None，但我却使用了 self.newly_parse，那么我该如何在 scrapy 中实现这个功能呢？

我不太优雅的解决方案：

把 init_parse 函数放到 newly_parse 里面，并在开始时检查 is_start_url，如果 response.url 在 start_urls 里，我们就会执行 init_parse 的过程。

另一个不太好的解决方案

把 # parse, return or yield items 的代码分离出来，做成一个类方法或者生成器，然后在 init_parse 和 newly_parse 里面都调用这个方法或生成器。

代码优化生成器请求处理类方法 scrapy 爬虫解析函数深度爬取

1 个回答

如果你打算在 newly_parse 下面返回多个项目，那么在 init_parse 里的代码应该是：

for item in self.newly_parse(response):
    yield item

因为 self.newly_parse 会返回一个生成器，你需要先遍历这个生成器，因为 Scrapy 不会自动识别它。

回答于 2025-04-18 由 Python大师

分享举报

scrapy：如何让响应被多个解析函数解析？

我不太优雅的解决方案：

另一个不太好的解决方案

1 个回答

撰写回答