我正在使用PySpark对一个数据集执行集群。为了找到簇的数量,我对一个值范围(2,20)进行了聚类,并找到了wsse
(在簇平方和内)的值k
。在这里我发现了一些不寻常的东西。根据我的理解,当你增加集群数量时,wsse
单调减少。但结果我得到了相反的答案。我只显示前几个集群的wsse
Results from spark
For k = 002 WSSE is 255318.793358
For k = 003 WSSE is 209788.479560
For k = 004 WSSE is 208498.351074
For k = 005 WSSE is 142573.272672
For k = 006 WSSE is 154419.027612
For k = 007 WSSE is 115092.404604
For k = 008 WSSE is 104753.205635
For k = 009 WSSE is 98000.985547
For k = 010 WSSE is 95134.137071
如果您查看k=5
和k=6
的wsse
值,您将看到wsse
已经增加。我求助于sklearn,看看我是否得到了类似的结果。我用于spark和sklearn的代码在文章末尾的附录部分。我尝试在spark和sklearnkmeans模型中使用相同的参数值。以下是sklearn的结果,正如我预期的那样,它们是单调递减的。在
我不知道为什么Spark中的wsse
值会增加。我尝试使用不同的数据集,也发现了相似的行为。我有什么地方出问题了吗?任何线索都很好。在
数据集位于 here。在
读取数据并设置declare variables
# get data
import pandas as pd
url = "https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv"
df_pandas = pd.read_csv(url)
df_spark = sqlContext(df_pandas)
target_col = 'high_income'
numeric_cols = [i for i in df_pandas.columns if i !=target_col]
k_min = 2 # 2 in inclusive
k_max = 21 # 2i is exlusive. will fit till 20
max_iter = 1000
seed = 42
这是我用来获取sklearn结果的代码:
from sklearn.cluster import KMeans as KMeans_SKL
from sklearn.preprocessing import StandardScaler as StandardScaler_SKL
ss = StandardScaler_SKL(with_std=True, with_mean=True)
ss.fit(df_pandas.loc[:, numeric_cols])
df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:, numeric_cols]))
wsse_collect = []
for i in range(k_min, k_max):
km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i)
_ = km.fit(df_pandas_scaled)
wsse = km.inertia_
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse))
wsse_collect.append(wsse)
这是我用来得到火花结果的代码
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.clustering import KMeans
standard_scaler_inpt_features = 'ss_features'
kmeans_input_features = 'features'
kmeans_prediction_features = 'prediction'
assembler = VectorAssembler(inputCols=numeric_cols, outputCol=standard_scaler_inpt_features)
assembled_df = assembler.transform(df_spark)
scaler = StandardScaler(inputCol=standard_scaler_inpt_features, outputCol=kmeans_input_features, withStd=True, withMean=True)
scaler_model = scaler.fit(assembled_df)
scaled_data = scaler_model.transform(assembled_df)
wsse_collect_spark = []
for i in range(k_min, k_max):
km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col,
k=i, maxIter=max_iter, seed=seed)
km_fit = km.fit(scaled_data)
wsse_spark = km_fit.computeCost(scaled_data)
wsse_collect_spark .append(wsse_spark)
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark))
更新
根据@Michail N的回答,我更改了SparkKMeans
模型的tol
和{
Spark MLlib, in fact, implements K-means||
我将initSteps
的数量增加了50倍,并重新运行了该过程,得到了以下结果。在
For k = 002 WSSE is 255318.718684
For k = 003 WSSE is 212364.906298
For k = 004 WSSE is 185999.709027
For k = 005 WSSE is 168616.028321
For k = 006 WSSE is 123879.449228
For k = 007 WSSE is 113646.930680
For k = 008 WSSE is 102803.889178
For k = 009 WSSE is 97819.497501
For k = 010 WSSE is 99973.198132
For k = 011 WSSE is 89103.510831
For k = 012 WSSE is 84462.110744
For k = 013 WSSE is 78803.619605
For k = 014 WSSE is 82174.640611
For k = 015 WSSE is 79157.287447
For k = 016 WSSE is 75007.269644
For k = 017 WSSE is 71610.292172
For k = 018 WSSE is 68706.739299
For k = 019 WSSE is 65440.906151
For k = 020 WSSE is 66396.106118
从k=5
和{k=13
和k=14
和其他地方,这种行为仍然存在,但至少我知道这是从哪里来的。在
WSSE不单调递减没有错。理论上,如果簇是最优的,WSSE必须单调地减少,这意味着从所有可能的k-中心中聚类出具有最佳WSSE的一个。在
问题是K-means不一定能找到最优聚类 对于给定的k,其迭代过程可以从一个随机起点收敛到 局部最小值,可能是好的,但不是最优的。在
有一些方法,如K-means++和Kmeans | |具有各种选择算法,这些算法更有可能选择不同的、分离的质心,从而更可靠地获得良好的聚类,而Spark MLlib实际上实现了K-means | |。然而,所有的选择都有随机性,不能保证最优聚类。在
选择k=6的随机启动的聚类集可能导致了一个特别次优的聚类,或者它可能在达到局部最优之前就已经停止了。在
您可以通过手动更改parameters of Kmeans来改进它。该算法通过tol设置了一个阈值,该阈值控制被认为重要的簇质心移动的最小量,其中较低的值意味着K-means算法将允许质心继续移动更长的时间。在
使用maxIter增加最大迭代次数也可以防止它过早停止,代价可能是更多的计算。在
所以我的建议是重新运行集群
相关问题 更多 >
编程相关推荐