用多处理运行多垃圾的最佳方式是什么?

2024-04-23 06:56:03 发布

您现在位置:Python中文网/ 问答频道 /正文

目前我在多处理中使用Scrapy。我做了一个POC,为了跑很多蜘蛛。 我的代码是这样的:

#!/usr/bin/python 
# -*- coding: utf-8 -*-
from multiprocessing import Lock, Process, Queue, current_process

def worker(work_queue, done_queue):
    try:
        for url in iter(work_queue.get, 'STOP'):
            status_code = run_spider(action)
    except Exception, e:
        done_queue.put("%s failed on %s with: %s" % (current_process().name, action, e.message))
    return True


def run_spider(action):
    os.system(action)

def main():
    sites = (
        scrapy crawl level1 -a url='https://www.example.com/test.html',
        scrapy crawl level1 -a url='https://www.example.com/test1.html',
        scrapy crawl level1 -a url='https://www.example.com/test2.html',
        scrapy crawl level1 -a url='https://www.example.com/test3.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
    )

    workers = 2
    work_queue = Queue()
    done_queue = Queue()
    processes = []

    for action in sites:
        work_queue.put(action)

    for w in xrange(workers):
        p = Process(target=worker, args=(work_queue, done_queue))
        p.start()
        processes.append(p)
        work_queue.put('STOP')

    for p in processes:
        p.join()

    done_queue.put('STOP')

    for status in iter(done_queue.get, 'STOP'):
        print status

if __name__ == '__main__':
    main()

你认为,运行多个Scrapy实例的最佳解决方案是什么?在

最好为每个URL启动一个垃圾实例,或者用xurl启动一个spider(例如:1个spider有100个链接)?在


Tags: inhttpscomurlforqueuehtmlwww
1条回答
网友
1楼 · 发布于 2024-04-23 06:56:03

It would be better to launch a Scrapy instance for each URL or launch a spider with x URL (ex: 1 spider with 100 links) ?

启动一个Scrapy实例绝对是一个糟糕的选择,因为对于每一个URL,您都将承受Scrapy本身的开销。在

我认为最好是将url均匀地分布在spider上。在

相关问题 更多 >