CuPY：看不到内核并发性

import cupy kernel = cupy.RawKernel( ''' extern "C" __global__ void simple_copy(float* iArr, float* oArr, int rows, int cols, int slice){ unsigned int col = blockDim.x*blockIdx.x + threadIdx.x; unsigned int row = blockDim.y*blockIdx.y + threadIdx.y; if(row < rows && col < cols){ //this for loop is just additional work to see kernel launches in visual profiler more easily for(int i=0; i<1000; i++){ oArr[rows*cols*slice + row*cols + col] = iArr[rows*cols*slice + row*cols + col]; } } } ''' , 'simple_copy') device = cupy.cuda.Device() # [x, y, z] iArr1 = cupy.ones((32*32, 32*32, 32), dtype=cupy.float32) oArr1 = cupy.zeros((32*32, 32*32, 32), dtype=cupy.float32) n = 32 map_streams = [] for i in range(n): map_streams.append(cupy.cuda.stream.Stream(non_blocking=True)) # I want to run kernel on individual z-axis slice asynchronous for i, stream in enumerate(map_streams): with stream: kernel((32, 32), (32, 32), (iArr1, oArr1, 32*32, 32*32, i)) device.synchronize()

1条回答

网友

1楼 · 发布于 2024-05-23 15:47:04

It seems like each RawKernel call is waiting for prevous kernel to finish eventhough I specify that stream is non-blocking.... .... Does anyone have an idea on what I'm doing wrong?

除了期待不可能发生的事情之外，你没有做错任何事

只有当有足够的资源同时运行多个内核时，并发内核执行才可能。所有当前支持的GPU每个多处理器最多有2048个活动线程，每个块有1024个线程。这意味着每个多处理器最多可以运行两个块。根据GPU的大小，这意味着可以以绝对最大值同时运行的块数不到大约60个。考虑到一次内核启动将多次“饱和”您的GPU容量，第二次实际运行资源的可能性非常小。这就是为什么在内核启动之间看不到重叠或并发

相关问题更多 >

编程相关推荐

热门问题

热门文章