以numpy数组为参数的Cython内联函数问题的回答

以numpy数组为参数的Cython内联函数

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

考虑这样的代码： <pre><code>import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t] arr, int i): arr[i]+= 1 def test1(np.ndarray[np.int32_t] arr): cdef int i for i in xrange(len(arr)): inc(arr, i) def test2(np.ndarray[np.int32_t] arr): cdef int i for i in xrange(len(arr)): arr[i] += 1 </code></pre> 我用ipython来测量test1和test2的速度： <pre><code>In [7]: timeit ttt.test1(arr) 100 loops, best of 3: 6.13 ms per loop In [8]: timeit ttt.test2(arr) 100000 loops, best of 3: 9.79 us per loop </code></pre> 有没有办法优化test1？为什么cython不按照上面说的那样内联这个函数？ 更新：实际上，我需要的是这样的多维代码： <pre><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j): arr[i, j] += 1 def test1(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 </code></pre> 时间安排： <pre><code>In [7]: timeit ttt.test1(arr) 1 loops, best of 3: 647 ms per loop In [8]: timeit ttt.test2(arr) 100 loops, best of 3: 2.07 ms per loop </code></pre> 显式内联使速度提高了300倍。而且我的实际函数很大，因此内联会使代码的可维护性更差 更新2: <pre><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j): arr[i, j]+= 1 def test1(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 cdef class FastPassingFloat2DArray(object): cdef float* data cdef int stride0, stride1 def __init__(self, np.ndarray[np.float32_t, ndim=2] arr): self.data = <float*>arr.data self.stride0 = arr.strides[0]/arr.dtype.itemsize self.stride1 = arr.strides[1]/arr.dtype.itemsize def __getitem__(self, tuple tp): cdef int i, j cdef float *pr, r i, j = tp pr = (self.data + self.stride0*i + self.stride1*j) r = pr[0] return r def __setitem__(self, tuple tp, float value): cdef int i, j cdef float *pr, r i, j = tp pr = (self.data + self.stride0*i + self.stride1*j) pr[0] = value cdef inline inc2(FastPassingFloat2DArray arr, int i, int j): arr[i, j]+= 1 def test3(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr) for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc2(tmparr, i,j) </code></pre> 时间安排： <pre><code>In [4]: timeit ttt.test1(arr) 1 loops, best of 3: 623 ms per loop In [5]: timeit ttt.test2(arr) 100 loops, best of 3: 2.29 ms per loop In [6]: timeit ttt.test3(arr) 1 loops, best of 3: 201 ms per loop </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

发帖至今已有3年多的时间，同时也取得了很大的进展。关于此代码（问题的更新2）： <pre><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j): arr[i, j]+= 1 def test1(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 </code></pre> 我有以下时间安排： <pre><code>arr = np.zeros((1000,1000), dtype=np.int32) %timeit test1(arr) %timeit test2(arr) 1 loops, best of 3: 354 ms per loop 1000 loops, best of 3: 1.02 ms per loop </code></pre> 因此，即使超过3年，这个问题还是可以重现的。Cython现在有<a href="http://docs.cython.org/src/userguide/memoryviews.html" rel="noreferrer">typed memoryviews</a>，因为它是Cython 0.16中引入的，所以在发布问题时不可用。有了这个： <pre><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(int[:, ::1] tmv, int i, int j): tmv[i, j]+= 1 def test3(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j cdef int[:, ::1] tmv = arr for i in xrange(tmv.shape[0]): for j in xrange(tmv.shape[1]): inc(tmv, i, j) def test4(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j cdef int[:, ::1] tmv = arr for i in xrange(tmv.shape[0]): for j in xrange(tmv.shape[1]): tmv[i,j] += 1 </code></pre> 有了这个我得到： <pre><code>arr = np.zeros((1000,1000), dtype=np.int32) %timeit test3(arr) %timeit test4(arr) 1000 loops, best of 3: 977 µs per loop 1000 loops, best of 3: 838 µs per loop </code></pre> 我们几乎快到那里了，而且已经比老式的方式快了！现在，<code>inc()</code>函数有资格声明<a href="http://docs.cython.org/src/userguide/external_C_code.html#declaring-a-function-as-callable-without-the-gil" rel="noreferrer">^{<cd2>}</a>，所以让我们声明它！但糟糕的是： <pre><code>Error compiling Cython file: [...] cdef inline inc(int[:, ::1] tmv, int i, int j) nogil: ^ [...] Function with Python return type cannot be declared nogil </code></pre> 啊，我完全错过了<code>void</code>返回类型的丢失！再一次但是现在用<code>void</code>： <pre><code>cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil: tmv[i, j]+= 1 </code></pre> 最后我得到： <pre><code>%timeit test3(arr) %timeit test4(arr) 1000 loops, best of 3: 843 µs per loop 1000 loops, best of 3: 853 µs per loop </code></pre> 和手动内联一样快！ <hr/> 为了好玩，我试过<a href="http://numba.pydata.org/" rel="noreferrer">Numba</a>这段代码： <pre><code>import numpy as np from numba import autojit, jit @autojit def inc(arr, i, j): arr[i, j] += 1 @autojit def test5(arr): for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) </code></pre> 我得到： <pre><code>arr = np.zeros((1000,1000), dtype=np.int32) %timeit test5(arr) 100 loops, best of 3: 4.03 ms per loop </code></pre> 尽管它比Cython慢4.7倍，很可能是因为JIT编译器未能内联<code>inc()</code>，但我认为它是非常棒的！我需要做的就是添加<code>@autojit</code>，而不必用笨拙的类型声明来搅乱代码；几乎不需要任何东西就可以加速88x！ 我也试过和努玛一起做其他事情，比如 <pre><code>@jit('void(i4[:],i4,i4)') def inc(arr, i, j): arr[i, j] += 1 </code></pre> 或<code>nopython=True</code>但未能进一步改善。 <a href="https://github.com/numba/numba/issues/160" rel="noreferrer">Improving inlining is on the Numba developers' list</a>，我们只需要提交更多的请求就可以使它具有更高的优先级。；）

以numpy数组为参数的Cython内联函数

1 个回答

相关Python问题