Python中的L1距离k均值算法

Question

给定一个大小为NxM的特征向量，存储在numpy矩阵中。有没有什么方法可以使用K均值算法，通过L1距离（曼哈顿距离）对它进行聚类？

Answer 1

在这里有一段代码，链接在这个地方，它使用了scipy.spatial.distance库中大约20种不同的距离计算方法。你也可以看看这个链接，能不能分享一下你在使用L1和L2距离时的结果呢？

Answer 2

这里有一个使用L1距离（曼哈顿距离）的K均值算法。为了通用性，特征向量被表示为一个列表，这样很容易转换成numpy矩阵。

    import random
    #Manhattan Distance
    def L1(v1,v2):
      if(len(v1)!=len(v2):
        print “error”
        return -1
      return sum([abs(v1[i]-v2[i]) for i in range(len(v1))])

    # kmeans with L1 distance. 
    # rows refers to the NxM feature vectors
    def kcluster(rows,distance=L1,k=4):# Cited from Programming Collective Intelligence 
        # Determine the minimum and maximum values for each point
        ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]

        # Create k randomly placed centroids
        clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

        lastmatches=None
        for t in range(100):
            print 'Iteration %d' % t
            bestmatches=[[] for i in range(k)]
            # Find which centroid is the closest for each row
            for j in range(len(rows)):
                row=rows[j]
                bestmatch=0
                for i in range(k):
                    d=distance(clusters[i],row)
                    if d<distance(clusters[bestmatch],row): 
                        bestmatch=i
                bestmatches[bestmatch].append(j)
            ## If the results are the same as last time, this is complete
            if bestmatches==lastmatches:
                break
            lastmatches=bestmatches

            # Move the centroids to the average of their members
            for i in range(k):
                avgs=[0.0]*len(rows[0])
                if len(bestmatches[i])>0:
                    for rowid in bestmatches[i]:
                        for m in range(len(rows[rowid])):
                            avgs[m]+=rows[rowid][m]
                    for j in range(len(avgs)):
                        avgs[j]/=len(bestmatches[i])
                    clusters[i]=avgs
        return bestmatches

Answer 3

我觉得在scipy里没有明确提供这个功能，不过你可以看看下面这个链接：

http://projects.scipy.org/scipy/ticket/612

Python中的L1距离k均值算法

4 个回答

撰写回答