将多维聚类绘制为2D图形 python
我在处理很多数据的聚类,这些数据分成了两种不同的类别。
第一种是一个6维的类别,而第二种是一个12维的类别。现在我决定使用kmeans算法,因为它看起来是最简单易懂的聚类方法,适合我刚开始使用。
我想知道怎么把这些类别在一个二维图上展示出来,这样我就能判断kmeans算法是否有效。我想用matplotlib这个库,但其他的Python库也可以。
类别1是由这些数据类型组成的(整数、浮点数、浮点数、整数、浮点数、整数)。
类别2是由12个浮点数类型组成的。
我想得到一个类似这样的输出
任何建议都很有帮助。
2 个回答
1
我在网上搜索了很多奇怪的没有评论的解决方案,最后终于搞明白怎么做了。如果你也想做类似的事情,这里有一段代码。代码来自不同的来源,还有很多是我自己写或修改的。我希望这段代码比其他地方的更容易理解。
这个函数是基于scipy里的kmeans2,它会返回一个质心列表和标签列表。kmeansdata是传给kmeans2进行聚类的numpy数组,而num_clusters表示传给kmeans2的聚类数量。
这段代码会生成一个新的png文件,确保不会覆盖其他文件。而且它只会绘制50个聚类(如果你有成千上万个聚类,就别试着输出全部了)。
(这段代码是为python2.7写的,我想其他版本也应该可以用。)
import numpy
import colorsys
import random
import os
from matplotlib.mlab import PCA as mlabPCA
from matplotlib import pyplot as plt
def get_colors(num_colors):
"""
Function to generate a list of randomly generated colors
The function first generates 256 different colors and then
we randomly select the number of colors required from it
num_colors -> Number of colors to generate
colors -> Consists of 256 different colors
random_colors -> Randomly returns required(num_color) colors
"""
colors = []
random_colors = []
# Generate 256 different colors and choose num_clors randomly
for i in numpy.arange(0., 360., 360. / 256.):
hue = i / 360.
lightness = (50 + numpy.random.rand() * 10) / 100.
saturation = (90 + numpy.random.rand() * 10) / 100.
colors.append(colorsys.hls_to_rgb(hue, lightness, saturation))
for i in range(0, num_colors):
random_colors.append(colors[random.randint(0, len(colors) - 1)])
return random_colors
def random_centroid_selector(total_clusters , clusters_plotted):
"""
Function to generate a list of randomly selected
centroids to plot on the output png
total_clusters -> Total number of clusters
clusters_plotted -> Number of clusters to plot
random_list -> Contains the index of clusters
to be plotted
"""
random_list = []
for i in range(0 , clusters_plotted):
random_list.append(random.randint(0, total_clusters - 1))
return random_list
def plot_cluster(kmeansdata, centroid_list, label_list , num_cluster):
"""
Function to convert the n-dimensional cluster to
2-dimensional cluster and plotting 50 random clusters
file%d.png -> file where the output is stored indexed
by first available file index
e.g. file1.png , file2.png ...
"""
mlab_pca = mlabPCA(kmeansdata)
cutoff = mlab_pca.fracs[1]
users_2d = mlab_pca.project(kmeansdata, minfrac=cutoff)
centroids_2d = mlab_pca.project(centroid_list, minfrac=cutoff)
colors = get_colors(num_cluster)
plt.figure()
plt.xlim([users_2d[:, 0].min() - 3, users_2d[:, 0].max() + 3])
plt.ylim([users_2d[:, 1].min() - 3, users_2d[:, 1].max() + 3])
# Plotting 50 clusters only for now
random_list = random_centroid_selector(num_cluster , 50)
# Plotting only the centroids which were randomly_selected
# Centroids are represented as a large 'o' marker
for i, position in enumerate(centroids_2d):
if i in random_list:
plt.scatter(centroids_2d[i, 0], centroids_2d[i, 1], marker='o', c=colors[i], s=100)
# Plotting only the points whose centers were plotted
# Points are represented as a small '+' marker
for i, position in enumerate(label_list):
if position in random_list:
plt.scatter(users_2d[i, 0], users_2d[i, 1] , marker='+' , c=colors[position])
filename = "name"
i = 0
while True:
if os.path.isfile(filename + str(i) + ".png") == False:
#new index found write file and return
plt.savefig(filename + str(i) + ".png")
break
else:
#Changing index to next number
i = i + 1
return