在Python中实现基于三角不等式加速的k-means聚类（Scikit Learn）

Question

我正在尝试对一个很大的数据集进行k均值聚类（有9106个项目，100个维度）。这让处理变得非常慢，所以有人建议我使用三角不等式，具体可以参考查尔斯·埃尔坎的文章（http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf）。

请问有没有现成的函数可以用来实现这个？

我一直在使用scikit-learn，以下是我的代码：

#implement a numpy array to hold the data
data_array = np.empty([9106,100])

#iterate through the data file anad add it to the numpy array
rownum = 0
for row in reader:
    if rownum != 0:
        print "rownum",rownum
        colnum = 0
        for col in row:
            if colnum !=0:
                data_array[rownum-1,colnum-1] = float(col)
                colnum+=1
    rownum += 1

n_samples, n_features = data_array.shape
n_digits = len(data_array)
labels = None #digits.target


#most of the code below was taken from the example on the scikit learn site
sample_size = 200

print "n_digits: %d, \t n_samples %d, \t n_features %d" % (n_digits,
                                                        n_samples, n_features)
len

print 79 * '_'
print ('% 9s' % 'init'
      '    time  inertia    homo   compl  v-meas     ARI     AMI  silhouette')


def bench_k_means(estimator, name, data):
    t0 = time()
    estimator.fit(data)
    print '% 9s   %.2fs    %i   %.3f   %.3f   %.3f   %.3f   %.3f    %.3f' % (
         name, (time() - t0), estimator.inertia_,
         metrics.homogeneity_score(labels, estimator.labels_),
         metrics.completeness_score(labels, estimator.labels_),
         metrics.v_measure_score(labels, estimator.labels_),
         metrics.adjusted_rand_score(labels, estimator.labels_),
         metrics.adjusted_mutual_info_score(labels,  estimator.labels_),
         metrics.silhouette_score(data, estimator.labels_,
                                  metric='euclidean',
                                  sample_size=sample_size),
         )


bench_k_means(KMeans(init='k-means++', k=n_digits, n_init=10),
              name="k-means++", data=data_array)

bench_k_means(KMeans(init='random', k=n_digits, n_init=10),
              name="random", data=data_array)

# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data_array)
bench_k_means(KMeans(init=pca.components_, k=n_digits, n_init=1),
              name="PCA-based",
              data=data_array)
print 79 * '_'

machine learning scikit-learn data analysis k-means algorithm optimization high-dimensional data clustering triangle inequality

在Python中实现基于三角不等式加速的k-means聚类（Scikit Learn）

2 个回答

撰写回答