Cuda Python错误:TypingError:无法确定<class'object'>

2024-04-29 17:11:51 发布

您现在位置:Python中文网/ 问答频道 /正文

背景:我正在尝试创建一个简单的引导函数,用于替换采样方法。我希望将函数并行化,因为我最终将在具有数百万个数据点的数据上部署该函数,并且希望样本量更大。我运行了其他示例,例如Mandelbrot示例。在下面的代码中,您将看到我有一个CPU版本的代码,它运行良好

我已经阅读了几篇参考资料来启动并运行它:

Random Numbers with CUDA

Writing Kernels in CUDA

问题:这是我第一次尝试CUDA编程,我相信我已经正确设置了所有内容。我遇到了一个我似乎无法理解的错误:

TypingError: cannot determine Numba type of <class 'object'>

我相信有关的LOC是:

bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu)

尝试解决此问题:我不会详细介绍,但以下是尝试

  • 我想这可能与cuda.to_device()有关。我改变了它,还调用了cuda.to\u device\u array\u like()。对于所有参数,我都使用了_device(),并且只使用了一些参数。我见过代码示例,其中它用于每个参数,有时不用于。所以我不确定该怎么办

  • 我已经删除了GPU的随机数生成器(create_xoroshiro128p_states),只使用了一个静态值进行测试

  • 使用int()显式赋值整数(而不是)。我不知道我为什么要这样做。我读到Numba只支持有限的数据类型,所以我确保它们是int

Numba Supported Datatypes

  • 还有几件事我不记得了

为混乱的代码道歉。我对此有点不知所措。

Below is the full code:

import numpy as np
from numpy import random
from numpy.random import randn
import pandas as pd
from timeit import default_timer as timer

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *

def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
    for i in range(boot_samp):
        rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
        out_mean[i] = dt_arry[rand_idx].mean()
     
@cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
    thread_id = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(thread_id, dt_arry.shape[0], stride):
        for k in range(0,n_samp-1,1):
            rand_idx_arry[k] = int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)         
        out_mean[thread_id] = dt_arry[rand_idx_arry].mean()



mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)

dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean

out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)

##################
# RUN ON CPU
##################

start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)

##################
# RUN ON GPU
##################

threads_per_block = 64
blocks_per_grid = 24

#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1) 

start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu_device.copy_to_host()
dt = timer() - start

print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)

Tags: inimportgpudevicedtrandomoutmean
1条回答
网友
1楼 · 发布于 2024-04-29 17:11:51

您似乎至少有4个问题:

  1. 在内核代码中,rand_idx_arry是未定义的
  2. 您不能在cuda设备代码中执行.mean()
  3. 内核启动配置参数是反向的
  4. 您的内核的网格步长循环范围不正确dt_array.shape[0]是50,因此您只填充gpu输出阵列中的前50个位置。与宿主代码一样,这个网格跨步循环的范围应该是输出数组的大小(即boot_samp

可能还有其他问题,但当我像这样重构代码以解决这些问题时,它似乎运行正常:

$ cat t65.py
#import matplotlib.pyplot as plt
import numpy as np
from numpy import random
from numpy.random import randn
from timeit import default_timer as timer

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *

def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
    for i in range(boot_samp):
        rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
        out_mean[i] = dt_arry[rand_idx].mean()

@cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
    thread_id = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(thread_id, out_mean.shape[0], stride):
        my_sum = 0.0
        for k in range(0,n_samp-1,1):
            my_sum += dt_arry[int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)]
        out_mean[thread_id] = my_sum/(n_samp-1)



mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)

dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean

#plt.plot(dt_arry)

#figureData = plt.figure(1)
#plt.title('Plot ' + str(n_samp) + ' samples')
#plt.plot(dt_arry)
#figureData.show()

out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)

##################
# RUN ON CPU
##################

start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)


#figureMeanCpu = plt.figure(2)
#plt.title('Plot '+ str(boot_samp) + ' bootstrap means - CPU')
#plt.plot(out_mean_cpu)
#figureData.show()


##################
# RUN ON GPU
##################

threads_per_block = 64
blocks_per_grid = 24

#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)

start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[blocks_per_grid, threads_per_block](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu = out_mean_gpu_device.copy_to_host()
dt = timer() - start

print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
python t65.py
CPU Bootstrap mean of 1000 mean samples: 10.148048544038735
Bootstrap CPU in 0.037496 s
GPU Bootstrap mean of 1000 mean samples: 10.145088765532936
Bootstrap GPU in 0.416822 s
$

注:

  • 我已经注释掉了一堆似乎不相关的东西。将来发布代码时,您可能希望执行类似操作(删除与您的问题无关的内容)
  • 我也在最后修正了一些关于你最终GPU打印输出的问题
  • 我没有仔细研究你的代码。我并不是说任何东西都是无缺陷的。我只是指出一些问题,并为如何解决这些问题提供指导。我可以看到CPU和GPU之间的结果不匹配,但是因为我不知道你在做什么,而且因为随机数生成器在CPU和GPU代码之间不匹配,所以我不清楚事情是否应该匹配

相关问题 更多 >