多进程通过共享内存传递字典数组

Question

下面的代码可以运行，但因为要处理大量数据，所以速度很慢。在实际使用中，创建进程和发送数据所花的时间几乎和计算时间一样长，因此当第二个进程创建时，第一个进程的计算几乎已经完成，这样并行处理就没什么意义了。

这段代码和这个问题中的代码是一样的，“多进程在合并结果时对992个整数有截止限制”，下面的修改建议已经实现了。不过，我遇到了一个常见的问题，估计是因为处理大数据时，序列化（也就是把数据转成可以存储或传输的格式）需要很长时间。

我看到有些回答提到使用multiprocessing.array来传递共享内存数组。我有一个大约4000个索引的数组，但每个索引都有一个包含200个键值对的字典。每个进程只读取这些数据，进行一些计算，然后返回一个矩阵（4000x3），里面没有字典。

像这样的回答 “共享只读数据是否会被复制到不同的进程中？” 使用了map。是否可以保持下面的系统并实现共享内存？有没有有效的方法可以将字典数组的数据发送给每个进程，比如把字典放在某个管理器里，然后把它放进multiprocessing.array中？

import multiprocessing

def main():
    data = {}
    total = []
    for j in range(0,3000):
        total.append(data)
        for i in range(0,200):
            data[str(i)] = i

    CalcManager(total,start=0,end=3000)

def CalcManager(myData,start,end):
    print 'in calc manager'
    #Multi processing
    #Set the number of processes to use.  
    nprocs = 3
    #Initialize the multiprocessing queue so we can get the values returned to us
    tasks = multiprocessing.JoinableQueue()
    result_q = multiprocessing.Queue()
    #Setup an empty array to store our processes
    procs = []
    #Divide up the data for the set number of processes 
    interval = (end-start)/nprocs 
    new_start = start
    #Create all the processes while dividing the work appropriately
    for i in range(nprocs):
        print 'starting processes'
        new_end = new_start + interval
        #Make sure we dont go past the size of the data 
        if new_end > end:
            new_end = end 
        #Generate a new process and pass it the arguments 
        data = myData[new_start:new_end]
        #Create the processes and pass the data and the result queue
        p = multiprocessing.Process(target=multiProcess,args=(data,new_start,new_end,result_q,i))
        procs.append(p)
        p.start()
        #Increment our next start to the current end 
        new_start = new_end+1
    print 'finished starting'    

    #Print out the results
    for i in range(nprocs):
        result = result_q.get()
        print result

    #Joint the process to wait for all data/process to be finished
    for p in procs:
        p.join()

#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
    print 'started process'
    results = []
    temp = []
    for i in range(0,22):
        results.append(temp)
        for j in range(0,3):
            temp.append(j)
    result_q.put(results)
    return

if __name__== '__main__':   
    main()

解决方案

通过把字典列表放进一个管理器，问题就解决了。

manager=Manager()
d=manager.list(myData)

看起来管理器不仅管理这个列表，还管理列表中的字典。启动时间有点慢，似乎数据还是在被复制，但这只在开始时做一次，然后在进程内部就可以直接使用这些数据。

import multiprocessing
import multiprocessing.sharedctypes as mt
from multiprocessing import Process, Lock, Manager
from ctypes import Structure, c_double

def main():
    data = {}
    total = []
    for j in range(0,3000):
        total.append(data)
        for i in range(0,100):
            data[str(i)] = i

    CalcManager(total,start=0,end=500)

def CalcManager(myData,start,end):
    print 'in calc manager'
    print type(myData[0])

    manager = Manager()
    d = manager.list(myData)

    #Multi processing
    #Set the number of processes to use.  
    nprocs = 3
    #Initialize the multiprocessing queue so we can get the values returned to us
    tasks = multiprocessing.JoinableQueue()
    result_q = multiprocessing.Queue()
    #Setup an empty array to store our processes
    procs = []
    #Divide up the data for the set number of processes 
    interval = (end-start)/nprocs 
    new_start = start
    #Create all the processes while dividing the work appropriately
    for i in range(nprocs):
        new_end = new_start + interval
        #Make sure we dont go past the size of the data 
        if new_end > end:
            new_end = end 
        #Generate a new process and pass it the arguments 
        data = myData[new_start:new_end]
        #Create the processes and pass the data and the result queue
        p = multiprocessing.Process(target=multiProcess,args=(d,new_start,new_end,result_q,i))
        procs.append(p)
        p.start()
        #Increment our next start to the current end 
        new_start = new_end+1
    print 'finished starting'    

    #Print out the results
    for i in range(nprocs):
        result = result_q.get()
        print len(result)

    #Joint the process to wait for all data/process to be finished
    for p in procs:
        p.join()

#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
    #print 'started process'
    results = []
    temp = []
    data = data[start:end]
    for i in range(0,22):
        results.append(temp)
        for j in range(0,3):
            temp.append(j)
    print len(data)        
    result_q.put(results)
    return

if __name__ == '__main__':
    main()

性能优化数据序列化多进程共享内存并行处理管理器数组传递字典数组

多进程通过共享内存传递字典数组

2 个回答

代码

讨论

撰写回答