爬网的刮擦限制页面不工作

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule import logging import os class FollowAllSpider(CrawlSpider): custom_settings = {"CLOSESPIDER_ITEMCOUNT": 1, "CONCURRENT_REQUEST": 1} name = 'follow_all' allowed_domains = ['testdomain.com'] start_urls = ['https://www.testdomain.com/simple-website/'] rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] def parse_item(self, response): dirname = os.path.dirname(__file__) filename = response.url.split("/")[-1] + '.html' filePath = os.path.join(dirname, "pages/", filename) with open(filePath, 'wb') as f: f.write(response.body) return

1条回答

网友

1楼 · 发布于 2024-04-25 22:58:56

如果要限制爬网的页面数，应使用CLOSESPIDER_PAGECOUNT而不是CLOSESPIDER_ITEMCOUNT

还值得注意的是，您的spider不yield任何项目，因此如果要使用CLOSESPIDER_ITEMCOUNT，则不需要计算任何项目，因为您直接在文件中写入

通过单击链接，您可以阅读有关CLOSESPIDER_PAGECOUNT和CLOSESPIDER_ITEMCOUNT的更多信息

最后一件事，当使用CLOSESPIDER_PAGECOUNT时，您应该注意以下警告，因为您的结果可能与您的期望不符：https://stackoverflow.com/a/34535390/11326319

相关问题更多 >

编程相关推荐

热门问题

热门文章