欧氏距离的高效精确计算

3条回答

网友

1楼 · 编辑于 2024-05-23 14:56:25

作为一般的经验法则，尽可能遵循scipy和numpy实现，因为它们是矢量化的，并且比本机Python代码快得多。（主要原因是：在C中实现，矢量化消除了循环的类型检查开销。）

（旁白：我的答案不包括精确性，但我认为同样的原则也适用于精确性和效率。）

作为一点额外的收获，我将提供一些关于如何配置代码以衡量效率的信息。如果您使用的是IPython解释器，那么秘诀就是使用%prun行魔术。

In [1]: import numpy

In [2]: from scipy.spatial import distance

In [3]: c1 = numpy.array((52, 106, 35, 12))

In [4]: c2 = numpy.array((33, 153, 75, 10))

In [5]: %prun distance.euclidean(c1, c2)
         35 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        4    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 distance.py:232(euclidean)
        2    0.000    0.000    0.000    0.000 distance.py:152(_validate_vector)
        2    0.000    0.000    0.000    0.000 shape_base.py:9(atleast_1d)
        1    0.000    0.000    0.000    0.000 misc.py:11(norm)
        1    0.000    0.000    0.000    0.000 function_base.py:605(asarray_chkfinite)
        2    0.000    0.000    0.000    0.000 numeric.py:476(asanyarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'squeeze' of 'numpy.ndarray' objects}


In [6]: %prun numpy.linalg.norm(c1 - c2)
         10 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%prun所做的是告诉您一个函数调用需要多长时间才能运行，包括一些跟踪来找出瓶颈可能在哪里。在这种情况下，scipy.spatial.distance.euclidean和numpy.linalg.norm实现都非常快。假设您定义了一个函数dist(vect1, vect2)，那么您可以使用相同的IPython magic调用进行分析。另一个额外的好处是，%prun也可以在Jupyter笔记本中工作，您可以通过简单地将%%prun设为该单元格的第一行来对整个代码单元格（而不仅仅是一个函数）进行%%prun配置。

网友

2楼 · 编辑于 2024-05-23 14:56:25

这并不能完全回答这个问题，但可能值得一提的是，如果您对实际的欧几里德距离不感兴趣，而只想比较欧几里德距离，那么平方根就是单调函数，即x**（1/2）<；y**（1/2）当且仅当x<；y

所以，如果你不想要显式的距离，但例如只想知道向量1的欧几里德距离是否更接近一个向量列表，称为向量列表，你可以避免昂贵的（在精度和时间方面）平方根，但可以做一些类似的事情

min(vectorlist, key = lambda compare: sum([(a - b)**2 for a, b in zip(vector1, compare)])

网友

3楼 · 编辑于 2024-05-23 14:56:25

结论一：

从使用timeit进行效率测试的结果中，我们可以得出关于效率的结论：

Method5 (zip, math.sqrt)>；Method1 (numpy.linalg.norm)>；Method2 (scipy.spatial.distance)>；Method3 (sklearn.metrics.pairwise.euclidean_distances )

虽然我并没有真正测试您的Method4，因为它不适合一般情况，并且通常相当于Method5。

对其他人来说，Method5是最快的。而对于使用Method1的numpy，正如我们所期望的，在C语言中经过了大量优化，是第二快的。

对于scipy.spatial.distance，如果直接转到函数定义，您将看到它实际上正在使用numpy.linalg.norm，除了它将在实际的numpy.linalg.norm之前对两个输入向量执行验证。这就是它比tnumpy.linalg.norm稍慢的原因。

最后对于sklearn，根据文档：

This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed. However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required

因为在您的问题中，您希望使用一组固定的数据，所以这个实现的优势没有得到体现。由于性能和精度之间的折衷，它也给出了所有方法中最差的精度。

关于精度，Method5=Metho1=Method2>；Method3

效率测试脚本：

import numpy as np
from scipy.spatial import distance
from sklearn.metrics.pairwise import euclidean_distances
import math

# 1
def eudis1(v1, v2):
    return np.linalg.norm(v1-v2)

# 2
def eudis2(v1, v2):
    return distance.euclidean(v1, v2)

# 3
def eudis3(v1, v2):
    return euclidean_distances(v1, v2)

# 5
def eudis5(v1, v2):
    dist = [(a - b)**2 for a, b in zip(v1, v2)]
    dist = math.sqrt(sum(dist))
    return dist

dis1 = (52, 106, 35, 12)
dis2 = (33, 153, 75, 10)
v1, v2 = np.array(dis1), np.array(dis2)

import timeit

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

wrappered1 = wrapper(eudis1, v1, v2)
wrappered2 = wrapper(eudis2, v1, v2)
wrappered3 = wrapper(eudis3, v1, v2)
wrappered5 = wrapper(eudis5, v1, v2)
t1 = timeit.repeat(wrappered1, repeat=3, number=100000)
t2 = timeit.repeat(wrappered2, repeat=3, number=100000)
t3 = timeit.repeat(wrappered3, repeat=3, number=100000)
t5 = timeit.repeat(wrappered5, repeat=3, number=100000)

print('\n')
print('t1: ', sum(t1)/len(t1))
print('t2: ', sum(t2)/len(t2))
print('t3: ', sum(t3)/len(t3))
print('t5: ', sum(t5)/len(t5))

效率测试输出：

t1:  0.654838958307
t2:  1.53977598714
t3:  6.7898791732
t5:  0.422228400305

精度测试脚本和结果：

In [8]: eudis1(v1,v2)
Out[8]: 64.60650122085238

In [9]: eudis2(v1,v2)
Out[9]: 64.60650122085238

In [10]: eudis3(v1,v2)
Out[10]: array([[ 64.60650122]])

In [11]: eudis5(v1,v2)
Out[11]: 64.60650122085238

结论一：

效率测试脚本：

效率测试输出：

精度测试脚本和结果：

相关问题更多 >

编程相关推荐

热门问题

热门文章