用多处理运行多垃圾的最佳方式是什么？

2024-04-23 06:56:03 发布

男 | 程序猿一只，喜欢编程写python代码。

目前我在多处理中使用Scrapy。我做了一个POC，为了跑很多蜘蛛。我的代码是这样的：

#!/usr/bin/python 
# -*- coding: utf-8 -*-
from multiprocessing import Lock, Process, Queue, current_process

def worker(work_queue, done_queue):
    try:
        for url in iter(work_queue.get, 'STOP'):
            status_code = run_spider(action)
    except Exception, e:
        done_queue.put("%s failed on %s with: %s" % (current_process().name, action, e.message))
    return True


def run_spider(action):
    os.system(action)

def main():
    sites = (
        scrapy crawl level1 -a url='https://www.example.com/test.html',
        scrapy crawl level1 -a url='https://www.example.com/test1.html',
        scrapy crawl level1 -a url='https://www.example.com/test2.html',
        scrapy crawl level1 -a url='https://www.example.com/test3.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
    )

    workers = 2
    work_queue = Queue()
    done_queue = Queue()
    processes = []

    for action in sites:
        work_queue.put(action)

    for w in xrange(workers):
        p = Process(target=worker, args=(work_queue, done_queue))
        p.start()
        processes.append(p)
        work_queue.put('STOP')

    for p in processes:
        p.join()

    done_queue.put('STOP')

    for status in iter(done_queue.get, 'STOP'):
        print status

if __name__ == '__main__':
    main()

你认为，运行多个Scrapy实例的最佳解决方案是什么？在

最好为每个URL启动一个垃圾实例，或者用xurl启动一个spider（例如：1个spider有100个链接）？在

Tags： in https com url for queue html www

1条回答

网友

1楼 · 发布于 2024-04-23 06:56:03

It would be better to launch a Scrapy instance for each URL or launch a spider with x URL (ex: 1 spider with 100 links) ?

启动一个Scrapy实例绝对是一个糟糕的选择，因为对于每一个URL，您都将承受Scrapy本身的开销。在

我认为最好是将url均匀地分布在spider上。在

用多处理运行多垃圾的最佳方式是什么？

相关问题更多 >

编程相关推荐

热门问题

热门文章

用多处理运行多垃圾的最佳方式是什么？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >