丘比特并发

2024-05-23 13:27:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用CuPy(7.0.0)并尝试通过一个简单的示例脚本获取并发流:

import cupy as cp

# creating streams
map_streams = []
for i in range(0, 100):
    map_streams.append(cp.cuda.stream.Stream(non_blocking=True))

asize = (1000, 100)

# creating arrays on the device
x = cp.ones(asize)
y = cp.ones(asize)
z = cp.ndarray(asize)

# do multiplications in the streams
for stream in map_streams:
    with stream:
        z = x * y

但是乘法是按顺序执行的。你知道吗

==8339== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
[...]
432.83ms  18.688us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        42  cupy_multiply__float64_float64_float64 [376]
433.01ms  19.391us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        43  cupy_multiply__float64_float64_float64 [381]
433.32ms  18.720us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        44  cupy_multiply__float64_float64_float64 [386]
433.52ms  19.936us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        45  cupy_multiply__float64_float64_float64 [391]
433.71ms  18.880us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        46  cupy_multiply__float64_float64_float64 [396]
433.89ms  19.680us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        47  cupy_multiply__float64_float64_float64 [401]
434.16ms  19.232us            (782 1 1)       (128 1 1)        14        0B        0B    Tesla K80 (0)         1        48  cupy_multiply__float64_float64_float64 [406]
[...]

有人能告诉我剧本里出了什么问题吗?你知道吗

更新:

即使我增加了工作负载,流也是按顺序处理的。你知道吗

asize = (1000, 200)

x = cp.random.rand(asize[0], asize[1])
y = cp.random.rand(asize[0], asize[1])
z = cp.ndarray(asize)


for stream in map_streams:
    with stream:
       z = cp.fft.fft2(x*y)

结果如下:

[...]
1.8e+10s  10.784us            (391 1 1)       (128 1 1)        12        0B        0B    Tesla K80 (0)         1       100  cupy_copy__float64_complex128 [5444]
1.8e+10s  20.384us             (50 1 1)         (8 5 5)        72  7.8125KB        0B    Tesla K80 (0)         1       100  void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5491]
1.8e+10s  10.464us             (49 1 1)       (128 1 1)        72        0B        0B    Tesla K80 (0)         1       100  void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5494]
1.8e+10s  29.055us             (63 1 1)       (10 16 1)        92        0B  6.2500KB    Tesla K80 (0)         1       100  void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5496]
1.8e+10s  10.176us            (391 1 1)       (128 1 1)        12        0B        0B    Tesla K80 (0)         1       101  cupy_copy__float64_complex128 [5502]
1.8e+10s  20.896us             (50 1 1)         (8 5 5)        72  7.8125KB        0B    Tesla K80 (0)         1       101  void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5549]
1.8e+10s  10.592us             (49 1 1)       (128 1 1)        72        0B        0B    Tesla K80 (0)         1       101  void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5552]
1.8e+10s  28.831us             (63 1 1)       (10 16 1)        92        0B  6.2500KB    Tesla K80 (0)         1       101  void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5554]
1.8e+10s  10.431us            (391 1 1)       (128 1 1)        12        0B        0B    Tesla K80 (0)         1       102  cupy_copy__float64_complex128 [5560]
1.8e+10s  20.959us             (50 1 1)         (8 5 5)        72  7.8125KB        0B    Tesla K80 (0)         1       102  void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5607]
1.8e+10s  10.720us             (49 1 1)       (128 1 1)        72        0B        0B    Tesla K80 (0)         1       102  void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5610]
1.8e+10s  28.640us             (63 1 1)       (10 16 1)        92        0B  6.2500KB    Tesla K80 (0)         1       102  void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5612]
[...]

Tags: fftkernelcpmultiplymsintusdouble