清除关闭后，Scrapy SitemapSpider未继续

2024-05-16 13:28:54 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在使用JOBSDIR运行一个scrapy sitemap spider，使用以下文档：https://docs.scrapy.org/en/latest/topics/jobs.html

但是，我将执行一个干净的关闭，ctrl+c（我已经反复检查了多次这是一个干净的关闭，我没有发送ctrl+c两次），spider将无法继续。输出将显示“过滤的重复请求：example.com/sitemap.xml” 但JOBSDIR文件夹中仍会有大量未看到的请求和非常大的requests.queue文件

为什么scrapy会过滤掉起始点站点地图，而不是使用requests.queue？显然，如果它过滤掉我提供的唯一网址，网站地图，它将永远不会得到任何地方。有什么想法吗

EDIT：我要做的一件事是从python脚本启动spider，而不是从scrapy命令行。这样做，也许我并没有通过斯拉皮建议的-s论点。“-s”是做什么的？这个变量的文档在哪里

    process = CrawlerProcess({'JOBDIR': '.jobs/alexa_site_' + str(alexa_site_id)})

    MySpider.set_settings(alexa_site)
    MySpider.custom_settings[
        'USER_AGENT'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"

    process.crawl(MySpider, alexa_site_id=alexa_site_id)
    process.start()

以下是蜘蛛代码：

class MySpider(SitemapSpider):
    custom_settings = {
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'MEMUSAGE_ENABLED': True,
        'DOWNLOAD_TIMEOUT': 20,
        'DEPTH_LIMIT': 100000,
        'LOG_LEVEL': 'CRITICAL',
        'DOWNLOADER_MIDDLEWARES': {
            'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
            'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
        },
        'ROTATING_PROXY_BAN_POLICY': 'spiders.classes.proxies.policy.MyPolicy',
        'ROTATING_PROXY_PAGE_RETRY_TIMES': 20,
        'ROTATING_PROXY_LOGSTATS_INTERVAL': 5,
        'ROTATING_PROXY_CLOSE_SPIDER': True,
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 403],
        'DEPTH_PRIORITY': 1,
        'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
        'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue',
        'TELNETCONSOLE_USERNAME': 'scrapy',
        'TELNETCONSOLE_PASSWORD': 'scrapy',
        'DUPEFILTER_DEBUG': True
    }

    name = None
    allowed_domains = []
    sitemap_urls = ['https://www.example.com/sitemap.xml']

    def parse(self, response):
        le = LinkExtractor()
        links = le.extract_links(response)
        for link in links:
            yield response.follow(link.ur, self.parse)

Tags： id true settings response site links process alexa

0条回答

目前没有回答

清除关闭后，Scrapy SitemapSpider未继续

相关问题更多 >

编程相关推荐

热门问题

热门文章

清除关闭后，Scrapy SitemapSpider未继续

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >