KMeans（）：获取类质心标签和对数据的引用

Sci工具学习Kmeans与PCA降维

我有一个数据集，2行7列，有不同的家庭用电量测量值，每个测量值都有一个日期。

日期

全球主动力

全球无功功率

电压

全球强度

Sub_计量_1

子计量器2

Sub_计量_3

我将数据集放入pandas数据框中，选择除了date列之外的所有列，然后执行交叉验证拆分。

import pandas as pd from sklearn.cross_validation import train_test_split data = pd.read_csv('household_power_consumption.txt', delimiter=';') power_consumption = data.iloc[0:, 2:9].dropna() pc_toarray = power_consumption.values hpc_fit, hpc_fit1 = train_test_split(pc_toarray, train_size=.01) power_consumption.head()

采用K-均值分类，PCA降维显示。

from sklearn.cluster import KMeans import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA hpc = PCA(n_components=2).fit_transform(hpc_fit) k_means = KMeans() k_means.fit(hpc) x_min, x_max = hpc[:, 0].min() - 5, hpc[:, 0].max() - 1 y_min, y_max = hpc[:, 1].min(), hpc[:, 1].max() + 5 xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02)) Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.figure(1) plt.clf() plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, aspect='auto', origin='lower') plt.plot(hpc[:, 0], hpc[:, 1], 'k.', markersize=4) centroids = k_means.cluster_centers_ inert = k_means.inertia_ plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=169, linewidths=3, color='w', zorder=8) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) plt.xticks(()) plt.yticks(()) plt.show()

现在我想知道哪些行属于给定的类，哪些日期属于给定的类。

有没有办法把图上的点与数据集，PCA之后？

一些我不知道的方法？

或者我的方法有根本的缺陷？

有什么建议吗？

我对这个领域还很陌生，我正在尝试阅读大量的代码，这是我看到的几个例子的汇编。

我的目标是对数据进行分类，然后得到属于一个类的日期。

谢谢你

1条回答

网友

1楼 · 发布于 2024-05-16 00:24:22

KMeans（）.predict（X）..docs here

预测X中每个样本所属的最近聚类。

在矢量量化文献中，聚类中心被称为码本，预测返回的每个值是码本中最接近的码的索引。

Parameters: (New data to predict)

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Returns: (Index of the cluster each sample belongs to)  

labels : array, shape [n_samples,]

我对你提交的代码的问题是

train_test_split()

它返回数据集中的两个随机行数组，有效地破坏了数据集的顺序，使从KMeans分类返回的标签很难与数据集中的连续日期相关联。

下面是一个例子：

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

#read data into pandas dataframe
df = pd.read_csv('household_power_consumption.txt', delimiter=';')

Raw Dataset head

#convert merge date and time colums and convert to datetime objects
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df.set_index(pd.DatetimeIndex(df['Datetime'],inplace=True))
df.drop(['Date','Time'], axis=1, inplace=True)

#put last column first
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.dropna()

preprocessed dates

#convert dataframe to data array and removes date column not to be processed, 
sliced = df.iloc[0:, 1:8].dropna()
hpc = sliced.values

k_means = KMeans()
k_means.fit(hpc)

# array of indexes corresponding to classes around centroids, in the order of your dataset
classified_data = k_means.labels_

#copy dataframe (may be memory intensive but just for illustration)
df_processed = df.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)

Finished

现在您可以在右侧看到与数据集匹配的结果。
既然它已经被分类了，那就看你的意思了。
从开始到结束，这只是一个很好的整体示例，说明了如何使用它。
显示你的结果，看主成分分析或使其他图形依赖于类。

Sci工具学习Kmeans与PCA降维

KMeans（）.predict（X）..docs here

预测X中每个样本所属的最近聚类。

相关问题更多 >

编程相关推荐

热门问题

热门文章