无监督学习 - 在numpy数组内对numpy数组进行聚类

2024-04-25 16:48:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我们正在处理一个语音数据集。波形转换为MFCC值。每行(wavfile)由大约20到40个(取决于声音文件的长度)数组组成,每个数组中有13个浮点值。这项任务的目标是识别10个语音数字。因为我们没有标签,我们想用一种学习方法把它们分成10组。你知道吗

代码如下所示:

def kmeans(data, k=3, normalize=False, limit= 500):
    """Basic k-means clustering algorithm.
    """
    # optionally normalize the data. k-means will perform poorly or strangely if the dimensions
    # don't have the same ranges.
    if normalize:
        stats = (data.mean(axis=0), data.std(axis=0))
        data = (data - stats[0]) / stats[1]

    # pick the first k points to be the centers. this also ensures that each group has at least
    # one point.
    centers = data[:k]

    for i in range(limit):
        # core of clustering algorithm...
        # first, use broadcasting to calculate the distance from each point to each center, then
        # classify based on the minimum distance.
        classifications = np.argmin(((data[:, :, None] - centers.T[None, :, :])**2).sum(axis=1), axis=1)
        # next, calculate the new centers for each cluster.
        new_centers = np.array([data[classifications == j, :].mean(axis=0) for j in range(k)])

        # if the centers aren't moving anymore it is time to stop.
        if (new_centers == centers).all():
            break
        else:
            centers = new_centers
    else:
        # this will not execute if the for loop exits on a break.
        raise RuntimeError(f"Clustering algorithm did not complete within {limit} iterations")

    # if data was normalized, the cluster group centers are no longer scaled the same way the original
    # data is scaled.
    if normalize:
        centers = centers * stats[1] + stats[0]

    print(f"Clustering completed after {i} iterations")

    return classifications, centers


classifications, centers = kmeans(speechdata, k=5)
plt.figure(figsize=(12, 8))
plt.scatter(x=speechdata[:, 0], y=speechdata[:, 1], s=100, c=classifications)
plt.scatter(x=centers[:, 0], y=centers[:, 1], s=500, c='k', marker='^')

行“classifications,centers=kmeans(speechdata,k=5)”给出了一个错误:IndexError:数组的索引太多。你知道吗

我如何转换数组数据的数组,长度不同(一行有形状(20,13),一行可能有形状(38,13),这样我就可以对它们进行聚类?你知道吗


Tags: thetonewfordataifstats数组