用于地理位置d聚类的DBSCAN

order_lat order_long 0 19.111841 72.910729 1 19.111342 72.908387 2 19.111342 72.908387 3 19.137815 72.914085 4 19.119677 72.905081 5 19.119677 72.905081 6 19.119677 72.905081 7 19.120217 72.907121 8 19.120217 72.907121 9 19.119677 72.905081 10 19.119677 72.905081 11 19.119677 72.905081 12 19.111860 72.911346 13 19.111860 72.911346 14 19.119677 72.905081 15 19.119677 72.905081 16 19.119677 72.905081 17 19.137815 72.914085 18 19.115380 72.909144 19 19.115380 72.909144 20 19.116168 72.909573 21 19.119677 72.905081 22 19.137815 72.914085 23 19.137815 72.914085 24 19.112955 72.910102 25 19.112955 72.910102 26 19.112955 72.910102 27 19.119677 72.905081 28 19.119677 72.905081 29 19.115380 72.909144 30 19.119677 72.905081 31 19.119677 72.905081 32 19.119677 72.905081 33 19.119677 72.905081 34 19.119677 72.905081 35 19.111860 72.911346 36 19.111841 72.910729 37 19.131674 72.918510 38 19.119677 72.905081 39 19.111860 72.911346 40 19.111860 72.911346 41 19.111841 72.910729 42 19.111841 72.910729 43 19.111841 72.910729 44 19.115380 72.909144 45 19.116625 72.909185 46 19.115671 72.908985 47 19.119677 72.905081 48 19.119677 72.905081 49 19.119677 72.905081 50 19.116183 72.909646 51 19.113827 72.893833 52 19.119677 72.905081 53 19.114100 72.894985 54 19.107491 72.901760 55 19.119677 72.905081

from scipy.spatial.distance import pdist, squareform distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v)))) array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071, 1.05925366, 1.05420922], [ 0.2522482 , 0. , 0. , ..., 1.44111548, 0.81742536, 0.98978355], [ 0.2522482 , 0. , 0. , ..., 1.44111548, 0.81742536, 0.98978355], ..., [ 1.67313071, 1.44111548, 1.44111548, ..., 0. , 1.02310118, 1.22871515], [ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118, 0. , 1.39923529], [ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515, 1.39923529, 0. ]])

3条回答

网友

1楼 · 编辑于 2024-05-15 20:51:54

DBSCAN是的意思是将用于原始数据，具有用于加速的空间索引。我知道的唯一一个加速地理距离的工具是ELKI（Java）-scikit learn不幸的是，它只支持一些距离，比如欧几里德距离（参见sklearn.neighbors.NearestNeighbors）。但很明显，你可以预先计算成对距离，所以这还不是一个问题。

但是，您没有仔细阅读文档，您认为DBSCAN使用距离矩阵是错误的：

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)

在距离矩阵行上使用欧氏距离，这显然没有任何意义。

请参阅DBSCAN（添加了重点）的文档：

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

类似于fit_predict：

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.

换句话说，你需要

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

网友
2楼 · 编辑于 2024-05-15 20:51:54

我不知道您使用的是haversine的什么实现，但它看起来返回的结果是km，所以eps应该是0.2，而不是200 m的2
对于min_samples参数，这取决于预期的输出是什么。这里有几个例子。我的输出使用了基于this answer的haversine的实现，它给出了一个类似的距离矩阵，但与您的距离矩阵不同。
这是db = DBSCAN(eps=0.2, min_samples=5)
[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
这将创建三个集群，0, 1和2，并且许多样本不属于至少有5个成员的集群，因此没有分配给集群（如-1）。
使用较小的min_samples值重试：
db = DBSCAN(eps=0.2, min_samples=2)
[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]
在这里，大多数样本都在至少一个其他样本200米的范围内，因此属于8个簇0到7中的一个。
编辑后添加
看起来“Anony Mousse”是对的，尽管我在结果中没有发现任何错误。为了贡献一些东西，下面是我用来查看集群的代码：
from math import radians, cos, sin, asin, sqrt from scipy.spatial.distance import pdist, squareform from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt import pandas as pd def haversine(lonlat1, lonlat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lat1, lon1 = lonlat1 lat2, lon2 = lonlat2 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) r = 6371 # Radius of earth in kilometers. Use 3956 for miles return c * r X = pd.read_csv('dbscan_test.csv') distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v)))) db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed') # using "precomputed" as recommended by @Anony-Mousse y_db = db.fit_predict(distance_matrix) X['cluster'] = y_db plt.scatter(X['lat'], X['lng'], c=X['cluster']) plt.show()

网友
3楼 · 编辑于 2024-05-15 20:51:54

您可以使用scikit learn的DBSCAN集群空间经纬度数据，而无需预先计算距离矩阵。

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))

这是关于clustering spatial data with scikit-learn DBSCAN的教程。特别要注意的是，eps值仍然是2km，但是它被6371除以，将其转换为弧度。另外，请注意.fit()采用haversine度量的弧度单位坐标。

相关问题更多 >

编程相关推荐

热门问题

热门文章