在Flask应用中后台运行Scrapy爬虫

Question

我正在开发一个使用Flask和Scrapy的应用。当有人访问我应用的根网址时，它会处理一些数据并显示出来。此外，我还想在我的爬虫没有运行时（重新）启动它。因为我的爬虫运行大约需要1.5小时，所以我使用线程将它作为后台进程运行。下面是一个简单的示例（你还需要testspiders）：

import os
from flask import Flask, render_template
import threading
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings    
from testspiders.spiders.followall import FollowAllSpider

def crawl():
    spider = FollowAllSpider(domain='scrapinghub.com')
    crawler = Crawler(Settings())
    crawler.configure()
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()

app = Flask(__name__)

@app.route('/')
def main():
    run_in_bg = threading.Thread(target=crawl, name='crawler')
    thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]

    if 'crawler' not in thread_names:
        run_in_bg.start()

    return 'hello world'

if __name__ == "__main__":
    port = int(os.environ.get('PORT', 5000))
    app.run(host='0.0.0.0', port=port)

顺便提一下，下面的代码是我临时想出来的方法，用来检查我的爬虫线程是否还在运行。如果有更好的方法，我希望能得到一些指导。

run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]

if 'crawler' not in thread_names:
    run_in_bg.start()

接下来谈谈问题——如果我把上面的脚本保存为crawler.py，运行python crawler.py并访问localhost:5000，我会遇到以下错误（忽略Scrapy的HtmlXPathSelector弃用警告）：

exceptions.ValueError: signal only works in main thread

虽然爬虫在运行，但它并没有停止，因为signals.spider_closed信号只在主线程中有效（根据这个错误）。正如预期的那样，后续对根网址的请求会产生大量错误。

我该如何设计我的应用，以便在爬虫没有运行时启动它，同时又能立即将控制权返回给我的应用（也就是说，我不想等爬虫完成）以处理其他事情呢？

数据处理信号处理线程后台进程 flask scrapy 爬虫应用设计

在Flask应用中后台运行Scrapy爬虫

1 个回答

撰写回答