Pycuda：多次调用内核的最佳方式

2条回答

网友
1楼 · 编辑于 2024-04-19 08:22:16

你的主要问题似乎太笼统了，在没有看到代码的情况下很难给出一些具体的建议。我会试着回答你的子问题（不是一个真正的答案，但评论有点长）
Do I need to create a new context on reach Kernel call?
没有
There is a way to not have to transfer memory from GPU to CPU, that is, starting a Kernel, pausing it to get some information, restating it and repeat.
取决于你说的“获取一些信息”是什么意思。如果这意味着在CPU上做一些事情，那么，当然，你必须转移它。如果要在另一个内核调用中使用，则不需要传输它。在
Each RK4 iteration takes roughly half a second, which is insane (also compared with the CUDA code in the link that does some similar operation).
它实际上取决于等式，线程数和你正在使用的显卡。我可以想象这样一种情况，一个RK步骤需要那么长时间。在
And I think this is due to something wrong with the way I'm using pycuda, so if you can explain the best way to do such an operation in the best manner, it could be great!.
没有密码就无法确定。试着创建一些最小的演示示例，或者，至少，发布一个链接到一个可运行的（即使它很长）一段代码来说明您的问题。至于PyCUDA，它是CUDA的一个非常薄的包装器，所有适用于后者的编程实践，也适用于前者。在

网友
2楼 · 编辑于 2024-04-19 08:22:16

我可能会帮助您处理内存，也就是说，在迭代过程中，不必从CPU复制到GPU。我正在使用euler时间步进法逐步发展一个系统，我在GPU上保存所有数据的方式如下所示。然而，问题在于，一旦第一个内核被启动，cpu就会继续执行它后面的行。一、边界核在时间演化步骤之前被启动。在
我需要的是一种同步的方法。我试过用它strm.同步（）（请参阅我的代码）但它并不总是有效的。如果您对此有任何想法，我将非常感谢您的意见！谢谢！在
def curveShorten(dist,timestep,maxit): """ iterates the function image on a 2d grid through an euler anisotropic diffusion operator with timestep=timestep maxit number of times """ image = 1*dist forme = image.shape if(np.size(forme)>2): sys.exit('Only works on gray images') aSize = forme[0]*forme[1] xdim = np.int32(forme[0]) ydim = np.int32(forme[1]) image[0,:] = image[1,:] image[xdim-1,:] = image[xdim-2,:] image[:,ydim-1] = image[:,ydim-2] image[:,0] = image[:,1] #np arrays i need to store things on the CPU, image is the initial #condition and final is the final state image = image.reshape(aSize,order= 'C').astype(np.float32) final = np.zeros(aSize).astype(np.float32) #allocating memory to GPUs image_gpu = drv.mem_alloc(image.nbytes) final_gpu = drv.mem_alloc(final.nbytes) #sending data to each memory location drv.memcpy_htod(image_gpu,image) #host to device copying drv.memcpy_htod(final_gpu,final) #block size: B := dim1*dim2*dim3=1024 #gird size : dim1*dimr2*dim3 = ceiling(aSize/B) blockX = int(1024) multiplier = aSize/float(1024) if(aSize/float(1024) > int(aSize/float(1024)) ): gridX = int(multiplier + 1) else: gridX = int(multiplier) strm1 = drv.Stream(1) ev1 = drv.Event() strm2 = drv.Stream() for k in range(0,maxit): Kern_diffIteration(image_gpu,final_gpu,ydim, xdim, np.float32(timestep), block=(blockX,1,1), grid=(gridX,1,1),stream=strm1) strm1.synchronize() if(strm1.is_done()==1): Kern_boundaryX0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1)) Kern_boundaryX1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1) Kern_boundaryY0(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm2) Kern_boundaryY1(final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1))#,stream=strm1) if(strm1.is_done()==1): drv.memcpy_dtod(image_gpu, final_gpu, final.nbytes) #Kern_copy(image_gpu,final_gpu,ydim,xdim,block=(blockX,1,1), grid=(gridX,1,1),stream=strm1) drv.memcpy_dtoh(final,final_gpu) #device to host copying #final_gpu.free() #image_gpu.free() return final.reshape(forme,order='C')

相关问题更多 >

编程相关推荐

热门问题

热门文章