PyOpenCl基准测试问题

3 投票
1 回答
2801 浏览
提问于 2025-04-17 02:01

我稍微修改了一下标准代码,来自这个链接:https://github.com/inducer/pyopencl/blob/master/examples/benchmark-all.py

把变量zz替换成了数字。

import pyopencl as cl
import numpy
import numpy.linalg as la
import datetime
from time import time
zz=100
a = numpy.random.rand(zz).astype(numpy.float32)
b = numpy.random.rand(zz).astype(numpy.float32)
c_result = numpy.empty_like(a)

# Speed in normal CPU usage
time1 = time()
for i in range(zz):
        for j in range(zz):
                c_result[i] = a[i] + b[i]
                c_result[i] = c_result[i] * (a[i] + b[i])
                c_result[i] = c_result[i] * (a[i] / 2)
time2 = time()
print("Execution time of test without OpenCL: ", time2 - time1, "s")


for platform in cl.get_platforms():
    for device in platform.get_devices():
        print("===============================================================")
        print("Platform name:", platform.name)
        print("Platform profile:", platform.profile)
        print("Platform vendor:", platform.vendor)
        print("Platform version:", platform.version)
        print("---------------------------------------------------------------")
        print("Device name:", device.name)
        print("Device type:", cl.device_type.to_string(device.type))
        print("Device memory: ", device.global_mem_size//1024//1024, 'MB')
        print("Device max clock speed:", device.max_clock_frequency, 'MHz')
        print("Device compute units:", device.max_compute_units)

        # Simnple speed test
        ctx = cl.Context([device])
        queue = cl.CommandQueue(ctx, 
                properties=cl.command_queue_properties.PROFILING_ENABLE)

        mf = cl.mem_flags
        a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
        b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
        dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)

        prg = cl.Program(ctx, """
            __kernel void sum(__global const float *a,
            __global const float *b, __global float *c)
            {
                        int loop;
                        int gid = get_global_id(0);
                        for(loop=0; loop<%s;loop++)
                        {
                                c[gid] = a[gid] + b[gid];
                                c[gid] = c[gid] * (a[gid] + b[gid]);
                                c[gid] = c[gid] * (a[gid] / 2);
                        }
                }
                """ % (zz)).build()

        exec_evt = prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)
        exec_evt.wait()
        elapsed = 1e-9*(exec_evt.profile.end - exec_evt.profile.start)

        print("Execution time of test: %g s" % elapsed)

        c = numpy.empty_like(a)
        cl.enqueue_read_buffer(queue, dest_buf, c).wait()
        error = 0
        for i in range(zz):
                if c[i] != c_result[i]:
                        error = 1
        if error:
                print("Results doesn't match!!")
        else:
                print("Results OK")

如果zz=100,我得到了:

('Execution time of test without OpenCL: ', 0.10500001907348633, 's')
===============================================================
('Platform name:', 'AMD Accelerated Parallel Processing')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'Advanced Micro Devices, Inc.')
('Platform version:', 'OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)')
---------------------------------------------------------------
('Device name:', 'Cypress\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
('Device type:', 'GPU')
('Device memory: ', 800, 'MB')
('Device max clock speed:', 850, 'MHz')
('Device compute units:', 20)
Execution time of test: 0.00168922 s
Results OK
===============================================================
('Platform name:', 'AMD Accelerated Parallel Processing')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'Advanced Micro Devices, Inc.')
('Platform version:', 'OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)')
---------------------------------------------------------------
('Device name:', 'Intel(R) Core(TM) i5 CPU         750  @ 2.67GHz\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
('Device type:', 'CPU')
('Device memory: ', 8183L, 'MB')
('Device max clock speed:', 3000, 'MHz')
('Device compute units:', 4)
Execution time of test: 4.369e-05 s
Results OK

我们有3个时间:

normal  ('Execution time of test without OpenCL: ', 0.10500001907348633, 's')
pyopencl radeon 5870: Execution time of test: 0.00168922 s
pyopencl i5 CPU 750: Execution time of test: 4.369e-05 s

第一个问题包:pyopencl在i5 CPU 750上是什么?为什么它比“正常”(没有OpenCL的测试执行时间)快250倍?而为什么它比“pyopencl radeon 5870”快大约38倍?

如果zz=1000,我们得到:

normal  ('Execution time of test without OpenCL: ', 9.05299997329712, 's')
pyopencl radeon 5870:Execution time of test: 0.0104431 s
pyopencl i5 CPU 750: Execution time of test: 0.00238112 s

i5*5=radeon5870

i5*3800=normal

如果zz=10000:

normal its to long... comment code...
redeon58700, Execution time of test: 0.085571 s
i5, Execution time of test: 0.261854 s

在这里我们看到如何赢得显卡的性能。

比较时间结果的顺序也很有趣。正常阶段1的时间乘以90等于正常阶段2,正常阶段2乘以大约95等于正常阶段3(根据经验)。

i5阶段1的时间乘以52等于i5阶段2,i5阶段2乘以109等于i5阶段3。

radeon5870阶段1的时间乘以6等于radeon阶段2,radeon阶段2乘以8等于radeon阶段3。

有人能解释一下为什么OpenCL的结果增长不是线性的吗?

1 个回答

2

嗯,增长的速度不太可能是线性的,因为算法的复杂度是 O(zz^2)。

要判断“线性”这个概念,你需要的数据点要多于3个(而且在做这种分析时,误差条也很有用),因为对于GPU来说,100个线程远远不够充分利用它的计算能力(正如你的实验所示,GPU只有在10,000个线程或更多时才会开始超越CPU——这其实是很正常的情况)。

在仅使用CPU的情况下,速度提升250倍也并不是不可能,因为Python是一种解释型语言,所以本身的速度并不快,而OpenCL则积极利用CPU的SIMD指令,这样即使和C+OpenMP相比,也能带来相当不错的速度提升。

撰写回答