Python集群“纯净”metri

from sklearn.mixture import GMM # X is a 1000 x 2 array (1000 samples of 2 coordinates). # It is actually a 2 dimensional PCA projection of data # extracted from the MNIST dataset, but this random array # is equivalent as far as the code is concerned. X = np.random.rand(1000, 2) clusterer = GMM(3, 'diag') clusterer.fit(X) cluster_labels = clusterer.predict(X) # Now I can count the labels for each cluster.. count0 = list(cluster_labels).count(0) count1 = list(cluster_labels).count(1) count2 = list(cluster_labels).count(2)

3条回答

网友

1楼 · 编辑于 2024-05-16 23:36:02

sklearn未实现群集纯度度量。你有两个选择：

自己使用sklearn数据结构实现度量。This和this有一些python源代码可用于测量纯度，但您的数据或函数体都需要进行调整，以便彼此兼容。
使用（不太成熟的）PML库，它确实实现了集群纯度。

网友

2楼 · 编辑于 2024-05-16 23:36:02

大卫的回答很有效，但这里有另一种方法。

import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)

另外，如果需要计算反纯度，只需将“axis=0”替换为“axis=1”。

网友

3楼 · 编辑于 2024-05-16 23:36:02

迟交的稿件。

您可以尝试像这样实现它，就像在这个gist

def purity_score(y_true, y_pred):
    """Purity score
        Args:
            y_true(np.ndarray): n*1 matrix Ground truth labels
            y_pred(np.ndarray): n*1 matrix Predicted clusters

        Returns:
            float: Purity score
    """
    # matrix which will hold the majority-voted labels
    y_voted_labels = np.zeros(y_true.shape)
    # Ordering labels
    ## Labels might be missing e.g with set like 0,2 where 1 is missing
    ## First find the unique labels, then map the labels to an ordered set
    ## 0,2 should become 0,1
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true==labels[k]] = ordered_labels[k]
    # Update unique labels
    labels = np.unique(y_true)
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bins
    # the bigger being excluded [bin_i, bin_i+1[
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_pred):
        hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
        # Find the most present label in the cluster
        winner = np.argmax(hist)
        y_voted_labels[y_pred==cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

相关问题更多 >

编程相关推荐

热门问题

热门文章