<p>我基本上同意@chepner和@胡安帕.阿里维拉加在评论中。此外,在mpy函数库中,所有的标量运算都是正确的。在</p>
<p>但是,实际上有一种方法可以显著提高cython代码的性能,这要归功于您的特定算法的构造方式,如果我们使用以下假设(并且可以容忍难看的代码):</p>
<ul>
<li>您的数组都是一维的,这使得遍历数组中的每个项变得非常简单。我们不需要替换更困难的numpy函数,例如<code>numpy.dot</code>,因为代码中的所有操作都只将标量与矩阵结合起来。在</li>
<li>虽然在python中使用<code>for</code>循环是不可想象的,但在cython中迭代每个索引是非常可行的。此外,最终输出中的每个项只依赖于对应于该项索引的输入(即第0项使用<code>u[0]</code>、<code>PorosityProfile[0]</code>等)。在</li>
<li>您对任何中间数组都不感兴趣,只对<code>compute_python</code>函数中返回的最终结果感兴趣。因此,为什么要浪费时间为所有这些中间numpy数组分配内存呢?在</li>
<li>使用<code>x**y</code>语法的速度非常慢。我使用<code>gcc</code>编译器选项<code> ffast-math</code>来显著改善这一点。我还使用了几个cython编译器指令来避免python检查和开销。在</li>
<li>创建numpy数组本身可能会有python开销,因此我使用类型化memoryviews(无论如何,这是首选的、更新的语法)和malloc指针的组合来创建输出数组,而不需要与python进行太多交互(只有两行代码,获取输出大小和return语句显示了重要的python交互,如cython注释文件)。在</li>
</ul>
<p>考虑到所有这些因素,下面是修改后的代码。它的性能比我笔记本电脑上天真的python版本快了近一个数量级。在</p>
<p><strong>升华.pyx</strong></p>
<pre><code>from libc.stdlib cimport malloc, free
def compute_cython(float[:] u, float[:] porosity_profile,
float[:] density_ice_profile, float[:] density_dust_profile,
float[:] density_profile):
cdef:
float dust_j, dust_f, dust_g, dust_h, dust_i
float ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h
int size, i
float dt, result_dust, x, dust
float result_ice_numer, result_ice_denom, result_ice, ice
float* out
dust_j, dust_f, dust_g, dust_h, dust_i = \
250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h = \
273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
size = len(u)
out = <float *>malloc(size * sizeof(float))
for i in range(size):
dt = u[i] - dust_j
result_dust = dust_f + (dust_g*dt) + (dust_h*dt**2) + (dust_i*dt**3)
x = u[i] / ice_i
result_ice_numer = x**3*(ice_c + ice_d*x**2 + ice_e*x**6)
result_ice_denom = 1 + ice_f*x**2 + ice_g*x**4 + ice_h*x**8
result_ice = result_ice_numer / result_ice_denom
ice = density_ice_profile[i]*result_ice
dust = density_dust_profile[i]*result_dust
out[i] = (dust + ice)/density_profile[i]
return <float[:size]>out
</code></pre>
<p><strong>设置.py</strong></p>
^{pr2}$
<p><strong>主.py</strong></p>
<pre><code>import numpy as np
import sublimation as sub
def compute_python(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
delta = u-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
x = u/IceI
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile
size = 100
u = np.random.rand(size).astype(np.float32)
porosity = np.random.rand(size).astype(np.float32)
ice = np.random.rand(size).astype(np.float32)
dust = np.random.rand(size).astype(np.float32)
density = np.random.rand(size).astype(np.float32)
"""
Run these from the terminal to out the performance!
python3 -m timeit -s "from main import compute_python, u, porosity, ice, dust, density" "compute_python(u, porosity, ice, dust, density)"
python3 -m timeit -s "from main import sub, u, porosity, ice, dust, density" "sub.compute_cython(u, porosity, ice, dust, density)"
python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython(u, porosity, ice, dust, density))"
The first command tests the python version. (10000 loops, best of 3: 45.5 usec per loop)
The second command tests the cython version, but returns just a memoryview object. (100000 loops, best of 3: 4.63 usec per loop)
The third command tests the cython version, but converts the result to a ndarray (slower). (100000 loops, best of 3: 6.3 usec per loop)
"""
</code></pre>
<p>让我知道,如果在我的解释中,这个答案是如何工作的,我希望它有帮助!在</p>
<hr/>
<p><strong>更新1:</strong></p>
<p>不幸的是,我无法让MSYS2和numba(这取决于LLVM)彼此友好相处,因此我无法进行任何直接比较。然而,按照@max9111的建议,我将<code>-march=native</code>添加到我的<code>setup.py</code>文件中的<code>args</code>列表中;但是,时间安排与之前没有明显不同。在</p>
<p>从<a href="https://stackoverflow.com/a/49804276">this great answer</a>来看,在numpy数组和类型化memoryviews之间的自动转换中,在初始函数调用中(以及在返回语句中,如果您将结果转换回原处)都会发生一些开销。恢复为使用如下函数签名:</p>
<pre><code>ctypedef np.float32_t DTYPE_t
def compute_cython_np(
np.ndarray[DTYPE_t, ndim=1] u,
np.ndarray[DTYPE_t, ndim=1] porosity_profile,
np.ndarray[DTYPE_t, ndim=1] density_ice_profile,
np.ndarray[DTYPE_t, ndim=1] density_dust_profile,
np.ndarray[DTYPE_t, ndim=1] density_profile):
</code></pre>
<p>每次调用节省1us,将其减少到3.6us,而不是4.6us,这有点重要,特别是如果函数要多次调用。<strong>当然,如果您计划多次调用该函数,那么只传入二维numpy数组可能更有效,这样可以节省大量的python函数调用开销,并分摊<code>numpy array -> typed memoryview</code>转换的成本。</strong>此外,使用numpy结构的数组可能会很有趣,它可以在cython中转换为一个类型化的memoryview结构,因为这样可以使缓存中的所有数据更接近,并加快内存访问时间。在</p>
<p>最后,正如前面的评论中所承诺的,这里有一个使用<code>prange</code>的版本,它利用了并行处理。注意,这只能与类型化的memoryviews一起使用,因为python的GIL必须在prange循环中发布(并使用<code>-fopenmp</code>fla进行编译)g代表<code>args</code>和<code>link_args</code>:</p>
<pre><code>from cython.parallel import prange
from libc.stdlib cimport malloc, free
def compute_cython_p(float[:] u, float[:] porosity_profile,
float[:] density_ice_profile, float[:] density_dust_profile,
float[:] density_profile):
cdef:
float dust_j, dust_f, dust_g, dust_h, dust_i
float ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h
int size, i
float dt, result_dust, x, dust
float result_ice_numer, result_ice_denom, result_ice, ice
float* out
dust_j, dust_f, dust_g, dust_h, dust_i = \
250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h = \
273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
size = len(u)
out = <float *>malloc(size * sizeof(float))
for i in prange(size, nogil=True):
dt = u[i] - dust_j
result_dust = dust_f + (dust_g*dt) + (dust_h*dt**2) + (dust_i*dt**3)
x = u[i] / ice_i
result_ice_numer = x**3*(ice_c + ice_d*x**2 + ice_e*x**6)
result_ice_denom = 1 + ice_f*x**2 + ice_g*x**4 + ice_h*x**8
result_ice = result_ice_numer / result_ice_denom
ice = density_ice_profile[i]*result_ice
dust = density_dust_profile[i]*result_dust
out[i] = (dust + ice)/density_profile[i]
return <float[:size]>out
</code></pre>
<hr/>
<p><strong>更新2:</strong></p>
<p>根据@max9111在注释中提供的非常有用的附加建议,我将代码中的所有<code>float[:]</code>声明切换到<code>float[::1]</code>。这一点的意义在于,它允许数据连续存储,而cython不需要担心元素之间是否存在跨距。这允许SIMD矢量化,从而显著地进一步优化代码。以下是使用以下命令生成的更新的计时:</p>
<pre><code>python3 -m timeit -s "from main import compute_python, u, porosity, ice, dust, density" "compute_python(u, porosity, ice, dust, density)"
python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython(u, porosity, ice, dust, density))"
python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython_p(u, porosity, ice, dust, density))"
size = 100
python: 44.7 usec per loop
cython serial: 4.44 usec per loop
cython parallel: 111 usec per loop
cython serial contiguous: 3.83 usec per loop
cython parallel contiguous: 116 usec per loop
size = 1000
python: 167 usec per loop
cython serial: 16.4 usec per loop
cython parallel: 115 usec per loop
cython serial contiguous: 8.24 usec per loop
cython parallel contiguous: 111 usec per loop
size = 10000
python: 1.32 msec per loop
cython serial: 128 usec per loop
cython parallel: 142 usec per loop
cython serial contiguous: 55.5 usec per loop
cython parallel contiguous: 150 usec per loop
size = 100000
python: 19.5 msec per loop
cython serial: 1.21 msec per loop
cython parallel: 691 usec per loop
cython serial contiguous: 473 usec per loop
cython parallel contiguous: 274 usec per loop
size = 1000000
python: 211 msec per loop
cython serial: 12.3 msec per loop
cython parallel: 5.74 msec per loop
cython serial contiguous: 4.82 msec per loop
cython parallel contiguous: 1.99 msec per loop
</code></pre>