Scipy.cluster.hierarchy.fclusterdata + 距离度量
1) 我正在使用scipy的hcluster模块。
我可以控制的变量是阈值变量。那我怎么知道每个阈值下的表现呢?比如在K均值聚类中,表现是所有点到它们中心点的距离总和。当然,这个需要调整,因为聚类越多,距离通常会越小。
我可以用hcluster做什么观察来了解这个吗?
2) 我发现fclusterdata有很多可用的指标。我是在根据关键术语的tf-idf对文本文件进行聚类。问题是,有些文档比其他文档长,我觉得余弦相似度是一个很好的方法来“标准化”这个长度问题,因为如果文档内容一致,文档在n维空间中的“方向”应该保持不变。还有其他方法可以建议吗?我该如何评估呢?
谢谢
1 个回答
5
我们可以计算每个点到聚类中心的平均距离,也就是|x - 聚类中心|,这和K均值算法是一样的。下面的代码就是用简单的方法来实现这个计算。(这个功能应该在scipy.cluster或scipy.spatial.distance里有现成的,但我找不到。)
关于你的第二个问题,我就不回答了。如果有好的层次聚类教程链接,欢迎分享。
#!/usr/bin/env python
""" cluster cities: pdist linkage fcluster plot
util: clusters() avdist()
"""
from __future__ import division
import sys
import numpy as np
import scipy.cluster.hierarchy as hier # $scipy/cluster/hierarchy.py
import scipy.spatial.distance as dist
import pylab as pl
from citiesin import citiesin # 1000 US cities
__date__ = "27may 2010 denis"
def clusterlists(T):
""" T = hier.fcluster( Z, t ) e.g. [a b a b a c]
-> [ [0 2 4] [1 3] [5] ] sorted by len
"""
clists = [ [] for j in range( max(T) + 1 )]
for j, c in enumerate(T):
clists[c].append( j )
clists.sort( key=len, reverse=True )
return clists[:-1] # clip the []
def avdist( X, to=None ):
""" av dist X vecs to "to", None: mean(X) """
if to is None:
to = np.mean( X, axis=0 )
return np.mean( dist.cdist( X, [to] ))
#...............................................................................
Ndata = 100
method = "average"
t = 0
crit = "maxclust"
# 'maxclust': Finds a minimum threshold `r` so that the cophenetic distance
# between any two original observations in the same flat cluster
# is no more than `r` and no more than `t` flat clusters are formed.
# but t affects cluster sizes only weakly ?
# t 25: [10, 9, 8, 7, 6
# t 20: [12, 11, 10, 9, 7
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # Ndata= t= ...
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True ) # .2f
me = __file__.split('/') [-1]
# biggest US cities --
cities = np.array( citiesin( n=Ndata )[0] ) # N,2
if t == 0: t = Ndata // 4
#...............................................................................
print "# %s Ndata=%d t=%d method=%s crit=%s " % (me, Ndata, t, method, crit)
Y = dist.pdist( cities ) # n*(n-1) / 2
Z = hier.linkage( Y, method ) # n-1
T = hier.fcluster( Z, t, criterion=crit ) # n
clusters = clusterlists(T)
print "cluster sizes:", map( len, clusters )
print "# average distance to centre in the biggest clusters:"
for c in clusters:
if len(c) < len(clusters[0]) // 3: break
cit = cities[c].T
print "%.2g %s" % (avdist(cit.T), cit)
if plot:
pl.plot( cit[0], cit[1] )
if plot:
pl.title( "scipy.cluster.hierarchy of %d US cities, %s t=%d" % (
Ndata, crit, t) )
pl.grid(False)
if plot >= 2:
pl.savefig( "cities-%d-%d.png" % (Ndata, t), dpi=80 )
pl.show()