使用numpy在矩阵上计算距离
我正在尝试用Python实现K均值算法(我知道有现成的库可以用,但我想自己学会怎么实现它)。这里是我遇到问题的函数:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
这个函数的目的是,给定m个点在n维空间中,以及k个质心在n维空间中,生成一个numpy数组(x1 x2 x3 x4 ... xm),其中x1是离第一个点最近的质心的索引。这个功能之前运行得很好,但当我尝试用4维的例子时,就出现了这个错误:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
我该怎么解决这个问题,或者如果解决不了,你有什么建议可以帮助我计算我想要的结果吗?
我的回答
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)
2 个回答
0
你需要定义一个距离函数,这里用到了 hypot。通常在 K-means 算法中,距离的计算方式是:距离 = 点与中心点之间的差的平方和。这里有一些 Matlab 代码可以实现这个功能……如果你不会的话,我可以帮你移植到其他语言,但你也可以试试看。就像你说的,只有通过实践才能学会。
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(@minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
更新
这个函数应该能返回距离,注意上面的 Matlab 代码只是返回了最近的中心点的距离(和索引)……而你的函数会返回所有的距离,下面的代码也是如此。
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance
5
试试这个:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
或者:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
别问这在干嘛 :D
编辑:开玩笑的。中间的广播操作(points[np.newaxis] - centroids[:,np.newaxis]
)是把原来的数组变成两个三维数组。结果是每个“平面”里都包含了所有点和某个中心点之间的差值。我们可以把这个叫做 diffs
。
接着,我们用常规的方法来计算欧几里得距离(就是差值的平方和再开平方):np.sqrt((diffs ** 2).sum(axis=2))
。最后我们得到一个 (k, m)
的矩阵,其中第0行包含了到 centroids[0]
的距离,依此类推。所以,.argmin(axis=0)
就能给你想要的结果。