使用numpy在矩阵上计算距离

1 投票
2 回答
3126 浏览
提问于 2025-04-17 10:03

我正在尝试用Python实现K均值算法(我知道有现成的库可以用,但我想自己学会怎么实现它)。这里是我遇到问题的函数:

def AssignPoints(points, centroids):
    """
    Takes two arguments:
    points is a numpy array such that points.shape = m , n where m is number of examples,
    and n is number of dimensions.

    centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
    k < m should hold.

    Returns:
    numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
    """

    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    distances = np.hypot(*temp)
    return distances.argmin(axis=1)

这个函数的目的是,给定m个点在n维空间中,以及k个质心在n维空间中,生成一个numpy数组(x1 x2 x3 x4 ... xm),其中x1是离第一个点最近的质心的索引。这个功能之前运行得很好,但当我尝试用4维的例子时,就出现了这个错误:

  File "/path/to/the/kmeans.py", line 28, in AssignPoints
    distances = np.hypot(*temp)
ValueError: invalid number of arguments

我该怎么解决这个问题,或者如果解决不了,你有什么建议可以帮助我计算我想要的结果吗?

我的回答

def AssignPoints(points, centroids):
    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    for i in xrange(len(temp)):
        temp[i] = temp[i] ** 2
    distances = np.add.reduce(temp) ** 0.5
    return distances.argmin(axis=1)

2 个回答

0

你需要定义一个距离函数,这里用到了 hypot。通常在 K-means 算法中,距离的计算方式是:距离 = 点与中心点之间的差的平方和。这里有一些 Matlab 代码可以实现这个功能……如果你不会的话,我可以帮你移植到其他语言,但你也可以试试看。就像你说的,只有通过实践才能学会。

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);


% Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
for loop=1:numberOfExamples
    Distance = sum(bsxfun(@minus,X(loop,:),centroids).^2,2);
    [value index] = min(Distance);
    idx(loop) = index;
end;


end

更新

这个函数应该能返回距离,注意上面的 Matlab 代码只是返回了最近的中心点的距离(和索引)……而你的函数会返回所有的距离,下面的代码也是如此。

def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
    distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance
5

试试这个:

np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)

或者:

diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)

别问这在干嘛 :D

编辑:开玩笑的。中间的广播操作(points[np.newaxis] - centroids[:,np.newaxis])是把原来的数组变成两个三维数组。结果是每个“平面”里都包含了所有点和某个中心点之间的差值。我们可以把这个叫做 diffs

接着,我们用常规的方法来计算欧几里得距离(就是差值的平方和再开平方):np.sqrt((diffs ** 2).sum(axis=2))。最后我们得到一个 (k, m) 的矩阵,其中第0行包含了到 centroids[0] 的距离,依此类推。所以,.argmin(axis=0) 就能给你想要的结果。

撰写回答