<p>我有了一个如何使用桶来解决这个问题的想法。其思想是根据元素的值和公差级别形成一个键。为了确保将存储桶“边缘”中的潜在匹配项与“边缘”中的其他元素进行比较,将比较所有相邻存储桶。最后,我修改了@Tim Roberts执行实际匹配的方法,使之在两列上都匹配</p>
<p>我把它做成了一个叫做<a href="https://pypi.org/project/close-numerical-matches/" rel="nofollow noreferrer">close-numerical-matches</a>的图书馆。示例用法:</p>
<pre class="lang-py prettyprint-override"><code>>>> import numpy as np
>>> from close_numerical_matches import find_matches
>>> arr0 = np.array([[25, 24], [50, 50], [25, 26]])
>>> arr1 = np.array([[25, 23], [25, 25], [50.6, 50.6], [60, 60]])
>>> find_matches(arr0, arr1, tol=1.0001)
array([[0, 0], [0, 1], [1, 2], [2, 1]])
>>> find_matches(arr0, arr1, tol=0.9999)
array([[1, 2]])
>>> find_matches(arr0, arr1, tol=0.60001)
array([], dtype=int64)
>>> find_matches(arr0, arr1, tol=0.60001, dist='max')
array([[1, 2]])
>>> manhatten_dist = lambda arr: np.sum(np.abs(arr), axis=1)
>>> matches = find_matches(arr0, arr1, tol=0.11, dist=manhatten_dist)
>>> matches
array([[0, 1], [0, 1], [2, 1]])
>>> indices0, indices1 = matches.T
>>> arr0[indices0]
array([[25, 24], [25, 24], [25, 26]])
</code></pre>
<p>一些分析:</p>
<pre class="lang-py prettyprint-override"><code>from timeit import default_timer as timer
import numpy as np
from close_numerical_matches import naive_find_matches, find_matches
arr0 = np.random.rand(320_000, 2)
arr1 = np.random.rand(44_000, 2)
start = timer()
naive_find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start) # 255.335 s
start = timer()
find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start) # 5.821 s
</code></pre>