群内相似性Kmeans

2021-01-18 21:18:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试在sklearnpython中使用kmeans对二维用户数据进行集群。我用肘形法(聚类数的增加不会使误差平方和显著下降的点)将正确的聚类数确定为50。在

在应用kmeans之后,我希望了解每个集群中数据点的相似性。既然我有50个集群,有没有一种方法可以得到一个数字(比如每个集群内的方差),这可以帮助我了解每个集群中的数据点有多接近。像0.8这样的数字意味着这些记录在每个簇中都有很高的方差,而0.2则意味着它们是密切相关的。在

因此,总而言之,有没有办法得到一个单一的数字来确定kmeans中的每个集群有多“好”?我们可以说善是相对的,但是让我们考虑一下,我更感兴趣的是集群内的方差,以确定一个特定的集群有多好。在

1条回答
网友
1楼 ·

使用来自https://plot.ly/scikit-learn/plot-kmeans-silhouette-analysis/的剪影得分的代码示例

from __future__ import print_function

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)  # For reproducibility

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
  # Initialize the clusterer with n_clusters value and a random generator
  # seed of 10 for reproducibility.
  clusterer = KMeans(n_clusters=n_clusters, random_state=10)
  cluster_labels = clusterer.fit_predict(X)
  print(cluster_labels)
  # The silhouette_score gives the average value for all the samples.
  # This gives a perspective into the density and separation of the formed
  # clusters
  silhouette_avg = silhouette_score(X, cluster_labels)
  print("For n_clusters =", n_clusters,
        "The average silhouette_score is :", silhouette_avg)

  # Compute the silhouette scores for each sample
  sample_silhouette_values = silhouette_samples(X, cluster_labels)

相关问题