目前我在多处理中使用Scrapy。我做了一个POC,为了跑很多蜘蛛。 我的代码是这样的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from multiprocessing import Lock, Process, Queue, current_process
def worker(work_queue, done_queue):
try:
for url in iter(work_queue.get, 'STOP'):
status_code = run_spider(action)
except Exception, e:
done_queue.put("%s failed on %s with: %s" % (current_process().name, action, e.message))
return True
def run_spider(action):
os.system(action)
def main():
sites = (
scrapy crawl level1 -a url='https://www.example.com/test.html',
scrapy crawl level1 -a url='https://www.example.com/test1.html',
scrapy crawl level1 -a url='https://www.example.com/test2.html',
scrapy crawl level1 -a url='https://www.example.com/test3.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
)
workers = 2
work_queue = Queue()
done_queue = Queue()
processes = []
for action in sites:
work_queue.put(action)
for w in xrange(workers):
p = Process(target=worker, args=(work_queue, done_queue))
p.start()
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
done_queue.put('STOP')
for status in iter(done_queue.get, 'STOP'):
print status
if __name__ == '__main__':
main()
你认为,运行多个Scrapy实例的最佳解决方案是什么?在
最好为每个URL启动一个垃圾实例,或者用xurl启动一个spider(例如:1个spider有100个链接)?在
启动一个Scrapy实例绝对是一个糟糕的选择,因为对于每一个URL,您都将承受Scrapy本身的开销。在
我认为最好是将url均匀地分布在spider上。在
相关问题 更多 >
编程相关推荐