如何在scikitlearn的高斯混合聚类算法中使用权重_init参数?

2024-04-20 13:48:24 发布

您现在位置:Python中文网/ 问答频道 /正文

前面的问题:如何使用sklearn.mixture.GaussianMixture(GMM)中的weights_init参数从单独python包执行的K-Means输出初始化GMM

目标:

  1. 使用RAPIDS CUML库在GPU集群上的大型数据集上执行K-Means聚类
  2. 使用目标1的输出初始化GaussianMixture

    要求:确保将外部K-Means算法与scikit learn的GMM配对后,产生与默认GMM初始化方法相同的行为

GMM的默认实现如下所示:

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=K, init_params='kmeans')

问题:在阅读文档、检查源代码并寻找其他自定义实现之后,我仍然有点不确定我的方法,特别是关于输入参数的使用weights_init。我建议的方法如下:

from cuml import KMeans
from sklearn.mixture import GaussianMixture

# KMeans performed on GPU cluster w/ CUML library:
km = KMeans(n_clusters=K)
km.fit_predict(data)
labels = km.labels_
centroids = km.cluster_centers_

# GMM performed on CPU w/ sklearn library:
gmm = GaussianMixture(n_components=K, means_init=centroids, weights_init=???)
labels = gmm.fit_predict(data)
centroids = gmm.means_

我可以想出几种方法来确定weights_init,但我追求的是默认实现中使用的方法。我的直觉表明,权重只是数据集中属于特定集群的样本的分数,但我找不到任何东西来证实这一点。提前感谢您的帮助或澄清


Tags: 方法fromimportlabelsinitsklearnmeansgmm
1条回答
网友
1楼 · 发布于 2024-04-20 13:48:24

下面提供了使用cuML的KMeans为sklearn的GaussianMixture创建权重以代替默认权重的代码。 您需要使用从cuML的KMeans模型获得的标签来创建权重。 我在下面的示例中使用了make_blobs数据集:

import numpy as np
from cuml.cluster import KMeans as cuKMeans

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.mixture import GaussianMixture
from sklearn.mixture._gaussian_mixture import _estimate_gaussian_parameters

n_samples = 100
n_features = 2

n_clusters = 5
random_state = 0

data, labels = make_blobs(n_samples=n_samples,
                          n_features=n_features,
                          centers=n_clusters,
                          random_state=random_state,
                          cluster_std=0.1)
km = cuKMeans(n_clusters=n_clusters, n_init=1)
km.fit(data)
label = km.labels_
centroids = km.cluster_centers_

# calculate the weights
resp = np.zeros((n_samples, n_clusters))
resp[np.arange(n_samples), label] = 1

weights, _, _ = _estimate_gaussian_parameters(data, resp, reg_covar=1e-6, covariance_type='full')
weights /= n_samples
print("weights : ", weights)

gmm = GaussianMixture(n_components=n_clusters, means_init=centroids, weights_init=weights)
labels = gmm.fit_predict(data)
gmm_centroids = gmm.means_
print(" gmm_centroids values with cuml weights : ")
print(gmm_centroids)

# default GM without cuml kmeans

default_gmm = GaussianMixture(n_components=n_clusters, means_init=centroids, weights_init=weights)
labels = default_gmm.fit_predict(data)
default_gmm_centroids = default_gmm.means_
print("gmm_centroids values with default weights : ")
print(default_gmm_centroids)

相关问题 更多 >