用于地理位置d聚类的DBSCAN

2024-05-15 20:51:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带经纬度对的数据帧。

这是我的数据框。

    order_lat  order_long
0   19.111841   72.910729
1   19.111342   72.908387
2   19.111342   72.908387
3   19.137815   72.914085
4   19.119677   72.905081
5   19.119677   72.905081
6   19.119677   72.905081
7   19.120217   72.907121
8   19.120217   72.907121
9   19.119677   72.905081
10  19.119677   72.905081
11  19.119677   72.905081
12  19.111860   72.911346
13  19.111860   72.911346
14  19.119677   72.905081
15  19.119677   72.905081
16  19.119677   72.905081
17  19.137815   72.914085
18  19.115380   72.909144
19  19.115380   72.909144
20  19.116168   72.909573
21  19.119677   72.905081
22  19.137815   72.914085
23  19.137815   72.914085
24  19.112955   72.910102
25  19.112955   72.910102
26  19.112955   72.910102
27  19.119677   72.905081
28  19.119677   72.905081
29  19.115380   72.909144
30  19.119677   72.905081
31  19.119677   72.905081
32  19.119677   72.905081
33  19.119677   72.905081
34  19.119677   72.905081
35  19.111860   72.911346
36  19.111841   72.910729
37  19.131674   72.918510
38  19.119677   72.905081
39  19.111860   72.911346
40  19.111860   72.911346
41  19.111841   72.910729
42  19.111841   72.910729
43  19.111841   72.910729
44  19.115380   72.909144
45  19.116625   72.909185
46  19.115671   72.908985
47  19.119677   72.905081
48  19.119677   72.905081
49  19.119677   72.905081
50  19.116183   72.909646
51  19.113827   72.893833
52  19.119677   72.905081
53  19.114100   72.894985
54  19.107491   72.901760
55  19.119677   72.905081

我想把这些最接近的点(200米的距离)聚在一起,下面是我的距离矩阵。

from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

array([[ 0.        ,  0.2522482 ,  0.2522482 , ...,  1.67313071,
     1.05925366,  1.05420922],
   [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,
     0.81742536,  0.98978355],
   [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,
     0.81742536,  0.98978355],
   ..., 
   [ 1.67313071,  1.44111548,  1.44111548, ...,  0.        ,
     1.02310118,  1.22871515],
   [ 1.05925366,  0.81742536,  0.81742536, ...,  1.02310118,
     0.        ,  1.39923529],
   [ 1.05420922,  0.98978355,  0.98978355, ...,  1.22871515,
     1.39923529,  0.        ]])

然后在距离矩阵上应用DBSCAN聚类算法。

 from sklearn.cluster import DBSCAN

 db = DBSCAN(eps=2,min_samples=5)
 y_db = db.fit_predict(distance_matrix)

我不知道如何选择eps&min_样本值。它将距离太远的点聚集在一个簇中(距离大约2公里),是因为它在聚集时计算欧几里德距离吗?请帮忙。


Tags: 数据fromimport距离dborder矩阵eps
3条回答

DBSCAN是的意思是将用于原始数据,具有用于加速的空间索引。我知道的唯一一个加速地理距离的工具是ELKI(Java)-scikit learn不幸的是,它只支持一些距离,比如欧几里德距离(参见sklearn.neighbors.NearestNeighbors)。 但很明显,你可以预先计算成对距离,所以这还不是一个问题。

但是,您没有仔细阅读文档,您认为DBSCAN使用距离矩阵是错误的:

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)

在距离矩阵行上使用欧氏距离,这显然没有任何意义。

请参阅DBSCAN(添加了重点)的文档:

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)

metric : string, or callable

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

类似于fit_predict

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

换句话说,你需要

db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

我不知道您使用的是haversine的什么实现,但它看起来返回的结果是km,所以eps应该是0.2,而不是200 m的2

对于min_samples参数,这取决于预期的输出是什么。这里有几个例子。我的输出使用了基于this answerhaversine的实现,它给出了一个类似的距离矩阵,但与您的距离矩阵不同。

这是db = DBSCAN(eps=0.2, min_samples=5)

[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]

这将创建三个集群,0, 12,并且许多样本不属于至少有5个成员的集群,因此没有分配给集群(如-1)。

使用较小的min_samples值重试:

db = DBSCAN(eps=0.2, min_samples=2)

[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]

在这里,大多数样本都在至少一个其他样本200米的范围内,因此属于8个簇07中的一个。

编辑后添加

看起来“Anony Mousse”是对的,尽管我在结果中没有发现任何错误。为了贡献一些东西,下面是我用来查看集群的代码:

from math import radians, cos, sin, asin, sqrt

from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN

import matplotlib.pyplot as plt
import pandas as pd


def haversine(lonlat1, lonlat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lat1, lon1 = lonlat1
    lat2, lon2 = lonlat2
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r


X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))

db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed')  # using "precomputed" as recommended by @Anony-Mousse
y_db = db.fit_predict(distance_matrix)

X['cluster'] = y_db

plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()

您可以使用scikit learn的DBSCAN集群空间经纬度数据,而无需预先计算距离矩阵。

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))

这是关于clustering spatial data with scikit-learn DBSCAN的教程。特别要注意的是,eps值仍然是2km,但是它被6371除以,将其转换为弧度。另外,请注意.fit()采用haversine度量的弧度单位坐标。

相关问题 更多 >