<p>在我看来,允许<em>纯</em>对象模式numba函数(或者如果numba实现了整个函数使用python对象,则没有警告),因为这些函数通常比纯python函数慢一些。在</p>
<p>Numba非常强大(与C扩展或Cython相比,无需类型声明就可以编写python代码的类型分派非常棒),但只有在它支持操作的情况下:</p>
<ul>
<li><a href="http://numba.pydata.org/numba-doc/dev/reference/pysupported.html" rel="nofollow noreferrer">Supported Python features in numba</a></li>
<li><a href="http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html" rel="nofollow noreferrer">Supported NumPy features in numba</a></li>
</ul>
<p>这意味着在“nopython”模式下不支持未列出的任何操作。如果numba不得不回到<a href="http://numba.pydata.org/numba-doc/dev/glossary.html#term-object-mode" rel="nofollow noreferrer">"object mode"</a>那就小心了:</p>
<blockquote>
<p><strong>object mode</strong></p>
<p>A Numba compilation mode that generates code that handles all values as Python objects and uses the Python C API to perform all operations on those objects. Code compiled in object mode will often run no faster than Python interpreted code, unless the Numba compiler can take advantage of loop-jitting.</p>
</blockquote>
<p>你的情况就是这样的:你完全是在对象模式下操作:</p>
<pre><code>>>> nbjaccard.inspect_types()
[...]
# - LINE 3 -
# seq1 = arg(0, name=seq1) :: pyobject
# seq2 = arg(1, name=seq2) :: pyobject
# $0.1 = global(set: <class 'set'>) :: pyobject
# $0.3 = call $0.1(seq1) :: pyobject
# $0.4 = global(set: <class 'set'>) :: pyobject
# $0.6 = call $0.4(seq2) :: pyobject
# set1 = $0.3 :: pyobject
# set2 = $0.6 :: pyobject
set1, set2 = set(seq1), set(seq2)
# - LINE 4 -
# $const0.7 = const(int, 1) :: pyobject
# $0.8 = global(len: <built-in function len>) :: pyobject
# $0.11 = set1 & set2 :: pyobject
# $0.12 = call $0.8($0.11) :: pyobject
# $0.13 = global(float: <class 'float'>) :: pyobject
# $0.14 = global(len: <built-in function len>) :: pyobject
# $0.17 = set1 | set2 :: pyobject
# $0.18 = call $0.14($0.17) :: pyobject
# $0.19 = call $0.13($0.18) :: pyobject
# $0.20 = $0.12 / $0.19 :: pyobject
# $0.21 = $const0.7 - $0.20 :: pyobject
# $0.22 = cast(value=$0.21) :: pyobject
# return $0.22
return 1 - len(set1 & set2) / float(len(set1 | set2))
</code></pre>
<p>如您所见,每一个操作都在Python对象上操作(如每行末尾的<code>:: pyobject</code>所示)。这是因为<code>numba</code>不支持<code>str</code>s和<code>set</code>s,所以这里绝对没有比这更快的了。但是你知道如何使用numpy数组或齐次列表(数值类型)来解决这个问题。在</p>
<p>在我的电脑上,时间差要大得多(使用numba 0.32.0),但单个计时要快得多-<strong>微秒</strong>秒(<code>10**-6</code>秒),而不是<strong>毫秒</strong>秒(<code>10**-3</code>秒):</p>
^{pr2}$
<p>注意,默认情况下<code>jit</code>是<a href="http://numba.pydata.org/numba-doc/latest/user/jit.html#lazy-compilation" rel="nofollow noreferrer">lazy</a>,因此第一个调用应该在执行计时之前完成,因为它包括编译代码的时间。在</p>
<hr/>
<p>不过,有一个优化你可以做:如果你知道两个集合的交集,你就可以计算联合的长度(正如@Paul Hankin在他的<em>现在删除了</em>答案中提到的那样):</p>
<pre><code>len(union) = len(set1) + len(set2) - len(intersection)
</code></pre>
<p>这将导致以下(纯python)代码:</p>
<pre><code>def jaccard2(seq1, seq2):
set1, set2 = set(seq1), set(seq2)
num_intersection = len(set1 & set2)
return 1 - num_intersection / float(len(set1) + len(set2) - num_intersection)
%timeit jaccard2("compare this string","compare a different string")
100000 loops, best of 3: 13.7 µs per loop
</code></pre>
<p>不是更快,而是更好。在</p>
<hr/>
<p>如果使用<a href="/questions/tagged/cython" class="post-tag" title="show questions tagged 'cython'" rel="tag">cython</a>,还有一些改进空间:</p>
<pre><code>%load_ext cython
%%cython
def cyjaccard(seq1, seq2):
cdef set set1 = set(seq1)
cdef set set2 = set()
cdef Py_ssize_t length_intersect = 0
for char in seq2:
if char not in set2:
if char in set1:
length_intersect += 1
set2.add(char)
return 1 - (length_intersect / float(len(set1) + len(set2) - length_intersect))
%timeit cyjaccard("compare this string","compare a different string")
100000 loops, best of 3: 7.97 µs per loop
</code></pre>
<p>这里的主要优点是,只需一次迭代,就可以创建<code>set2</code>并计算交集中元素的数量(根本不需要创建交集)!在</p>