爬行蜘蛛，无法跟踪链接

start_urls = ["http://home.mercadolivre.com.br/mais-categorias/"] rules = ( # I would like this to force the spider to crawl through the pages... calling the product parser each time Rule(LxmlLinkExtractor(allow=(), restrict_xpaths = '//*[@id="results-section"]/div[2]/ul/li[@class="pagination__next"]'), follow = True, callback = 'parse_product_links'), ) def parse(self, response): categories = CategoriesItem() #categories['categoryLinks'] = [] for link in LxmlLinkExtractor(allow=('(?<=http://lista.mercadolivre.com.br/delicatessen/)(?:whisky|licor|tequila|vodka|champagnes)'), restrict_xpaths = ("//body")).extract_links(response): categories['categoryURL'] = link.url yield Request(link.url, meta={'categoryURL': categories['categoryURL']}, callback = self.parse_product_links) # ideally this function would grab the product links from each page def parse_product_links(self, response): # I have this built out in my code, but it isnt necessary so I wanted to keep it as de-cluttered as possible

1条回答

网友

1楼 · 发布于 2024-04-24 03:30:26

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

来自CrawlSpider文档。在

如果您不熟悉scrapy的工作原理，那么使用CrawlSpider是非常不明智的。这是一种非常含蓄的捷径，可能会让人困惑。在

在您的例子中，您重写了parse，这是不应该发生的，并且您只有下一页的规则。所以去掉这个parse方法，并扩展规则以包含两个规则：查找产品的规则和查找页面的规则（对于这个规则，follow设置为True，因为您希望在新页面中查找新页面）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章