Cython函数比纯python花费更多的时间问题的回答

Cython函数比纯python花费更多的时间

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我试图加速我的代码，但这部分代码给我带来了问题 我尝试使用Cython，然后遵循给定的建议<a href="http://cython.readthedocs.io/en/latest/src/tutorial/numpy.html" rel="nofollow noreferrer">here</a>，但是我的纯python函数比Cython和Cython优化的函数性能都要好 cython代码如下： <pre><code>import numpy as np cimport numpy as np DTYPE = np.float ctypedef np.float_t DTYPE_t cimport cython @cython.boundscheck(False) @cython.wraparound(False) def compute_cython(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile): DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6 IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6 delta = u-DustJ result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3); x= u/IceI; result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile def compute_cythonOptimized(np.ndarray[DTYPE_t, ndim=1] u, np.ndarray[DTYPE_t, ndim=1] PorosityProfile, np.ndarray[DTYPE_t, ndim=1] DensityIceProfile, np.ndarray[DTYPE_t, ndim=1] DensityDustProfile, np.ndarray DensityProfile): assert u.dtype == DTYPE assert PorosityProfile.dtype == DTYPE assert DensityIceProfile.dtype == DTYPE assert DensityDustProfile.dtype == DTYPE assert DensityProfile.dtype == DTYPE cdef float DustJ = 250.0 cdef float DustF = 633.0 cdef float DustG = 2.513 cdef float DustH = -2.2e-3 cdef float DustI = -2.8e-6 cdef float IceI = 273.16 cdef float IceC = 1.843e5 cdef float IceD = 1.6357e8 cdef float IceE = 3.5519e9 cdef float IceF = 1.6670e2 cdef float IceG = 6.4650e4 cdef float IceH = 1.6935e6 cdef np.ndarray[DTYPE_t, ndim=1] delta = u-DustJ cdef np.ndarray[DTYPE_t, ndim=1] result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3); cdef np.ndarray[DTYPE_t, ndim=1] x= u/IceI; cdef np.ndarray[DTYPE_t, ndim=1] result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile </code></pre> 然后运行以下命令： ^{pr2}$ 结果如下： 对于纯python:<code>68.9 µs ± 851 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)</code> 对于非优化cython:<code>68.2 µs ± 685 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)</code> 对于优化的一个：<code>72.7 µs ± 416 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)</code> 我做错什么了？在 谢谢你的帮助

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>使用Numba的解决方案</h2> 代码外科医生已经用Cython给出了一个很好的答案。在这个答案中，我不想展示另一种使用Numba的方法。在 我创造了三个版本。在<code>naive_numba</code>中，我只添加了一个函数修饰符。在<code>improved_Numba</code>中，我手动组合了这些循环（每个矢量化的命令实际上都是一个循环）。在<code>improved_Numba_p</code>中，我已经并行化了函数。请注意，在使用平行加速器时，显然存在一个不允许定义常量值的错误。还需要注意的是，并行化版本只对较大的输入数组有利。但您也可以添加一个小包装器，根据输入数组的大小调用单线程或并行化版本。在 代码dtype=float64 <pre><code>import numba as nb import numpy as np import time @nb.njit(fastmath=True) def naive_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile): DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6 IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6 delta = u-DustJ result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3); x= u/IceI; result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile #error_model='numpy' sets divison by 0 to NaN instead of throwing a exception, this allows vectorization @nb.njit(fastmath=True,error_model='numpy') def improved_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile): DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6 IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6 res=np.empty(u.shape[0],dtype=u.dtype) for i in range(u.shape[0]): delta = u[i]-DustJ result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3); x= u[i]/IceI result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i] return res #there is obviously a bug in Numba (declaring const values in the function) @nb.njit(fastmath=True,parallel=True,error_model='numpy') def improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH): res=np.empty((u.shape[0]),dtype=u.dtype) for i in nb.prange(u.shape[0]): delta = u[i]-DustJ result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3); x= u[i]/IceI result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i] return res u=np.array(np.random.rand(1000000),dtype=np.float32) PorosityProfile=np.array(np.random.rand(1000000),dtype=np.float32) DensityIceProfile=np.array(np.random.rand(1000000),dtype=np.float32) DensityDustProfile=np.array(np.random.rand(1000000),dtype=np.float32) DensityProfile=np.array(np.random.rand(1000000),dtype=np.float32) DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6 IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6 #don't measure compilation overhead on first call res=improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH) for i in range(1000): res=improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH) print(time.time()-t1) print(time.time()-t1) </code></pre> 性能 ^{pr2}$ 代码类型=np.浮动32 如果np.浮动32只需将函数中的所有常量值显式声明为float32就足够了。否则Numba将使用float64。在 <pre><code>@nb.njit(fastmath=True,error_model='numpy') def improved_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile): DustJ, DustF, DustG, DustH, DustI = nb.float32(250.0), nb.float32(633.0), nb.float32(2.513), nb.float32(-2.2e-3), nb.float32(-2.8e-6) IceI, IceC, IceD, IceE, IceF, IceG, IceH = nb.float32(273.16), nb.float32(1.843e5), nb.float32(1.6357e8), nb.float32(3.5519e9), nb.float32(1.6670e2), nb.float32(6.4650e4), nb.float32(1.6935e6) res=np.empty(u.shape[0],dtype=u.dtype) for i in range(u.shape[0]): delta = u[i]-DustJ result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3) x= u[i]/IceI result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(nb.float32(1)+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i] return res @nb.njit(fastmath=True,parallel=True,error_model='numpy') def improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile): res=np.empty((u.shape[0]),dtype=u.dtype) DustJ, DustF, DustG, DustH, DustI = nb.float32(250.0), nb.float32(633.0), nb.float32(2.513), nb.float32(-2.2e-3), nb.float32(-2.8e-6) IceI, IceC, IceD, IceE, IceF, IceG, IceH = nb.float32(273.16), nb.float32(1.843e5), nb.float32(1.6357e8), nb.float32(3.5519e9), nb.float32(1.6670e2), nb.float32(6.4650e4), nb.float32(1.6935e6) for i in nb.prange(u.shape[0]): delta = u[i]-DustJ result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3) x= u[i]/IceI result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(nb.float32(1)+IceF*(x**2)+IceG*(x**4)+IceH*(x**8)) res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i] return res </code></pre> 性能 <pre><code>Arraysize np.random.rand(100).astype(np.float32) Numpy 29.3µs improved Numba: 1.33µs improved_Numba_p: 18µs Arraysize np.random.rand(1000000).astype(np.float32) Numpy 117ms improved Numba: 2.46ms improved_Numba_p: 1.56ms </code></pre> 与@coderegram提供的Cython版本相比并不公平，因为他没有使用启用的AVX2和FMA3指令编译函数。Numba在默认情况下使用-march=native编译，它在我的核心i7-4xxx上启用AVX2和FMA3指令。在 但是如果你不想发布一个编译过的Cython版本的代码，这就很有意义了，因为如果启用了这些优化，它将不会在pre-Haswell处理器（或所有的奔腾和赛扬）上运行。编译多个代码路径应该是可能的，但这取决于编译器，而且需要更多的工作。在

Cython函数比纯python花费更多的时间

1 个回答

相关Python问题