使用TFIDF在Spark MLLIB Kmeans中为文本clutsering索引超出范围

clusters = KMeans.train(tfidf_vectors, 2, maxIterations=10) def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = tfidf_vectors.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE)) clusters.save(sc, "myModelPath") sameModel = KMeansModel.load(sc, "myModelPath")

1条回答

网友

1楼 · 发布于 2024-05-14 21:37:00

我今天已经遇到了一个类似的问题，它看起来像是is a bug。TFIDF创建SparseVectors，如下所示：

>>> from pyspark.mllib.linalg import Vectors
>>> sv = Vectors.sparse(5, {1: 3})

使用大于最后一个非零值的索引访问值会导致异常：

^{pr2}$

快速但不是很有效的解决方法是将SparseVector转换为NumPy数组：

def error(point):                                                         
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point.toArray() - center)]))

相关问题更多 >

编程相关推荐

热门问题

热门文章