在CUDA+Numba中的行上循环

@cuda.jit def forward_energy(im, energy, m): row, col = cuda.grid(2) xstride, _ = cuda.gridsize(2) height, width = im.shape[0], im.shape[1] if row >= height or col >= width: return for i in range(row, im.shape[0], xstride): # example code below is dependent on the previous row of `im` energy[i, col] = min(im[i - 1, col], im[i - 1, col - 1]) im[i, col] = # update current row

1条回答

网友
1楼 · 发布于 2024-06-07 11:46:33

为了完整起见，这是我最终的方法：
我删除了上面代码中的跨步
添加了调用新内核的for循环：
for row in range(1, image.shape[0]): forward_energy[(1,),(1, image.shape[1])](row, imd, ed, md) cuda.synchronize()
并更新了要在行中使用的函数参数：
def foward_energy(row, im, energy, m): # algo stuff
这个thread是异步内核调用及其行为的有用读物。在运行下一个循环之前，我不能完全确定cuda.synchronize()是否有必要，但我宁愿谨慎行事
更新（2020年12月23日）：
尽管上述解决方案效果良好，但速度却慢得令人痛苦。谢天谢地，Numba即将在0.53中发布合作组（根据他们的github里程碑）。如果你是在0.53发布后阅读这篇文章，那么直接进入下面的代码；如果没有，您可以将最新的numba与此setup guide一起使用。这将允许您访问numba的cooperative groups，cuda.cg，从而执行网格范围的同步
该解决方案基本上不需要多次启动内核。前面，for循环被提升出内核，代码现在在内核中有for循环：
threads_per_block = 128 blocks_per_grid = math.ceil(image.shape[1] / threads_per_block)
在内核中的行上循环：
@cuda.jit def forward_energy(im, *args): col = cuda.grid(1) g = cuda.cg.this_grid() for row in range(1, im.shape[0]): # do stuff g.sync()
调用内核一次：
forward_energy[blocks_per_grid, threads_per_block](im, *args)
就这样。比以前管理连续行的方法快得多。据我所知，g.sync()所做的是，每个线程都需要调用g.sync()才能进入下一个循环。这会强制按顺序执行每一行。好东西

相关问题更多 >

编程相关推荐

热门问题

热门文章