为数据集中的所有点查找距离最近的点

2024-04-25 02:25:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下的数据集

Id     Latitude      longitude
1      25.42         55.47
2      25.39         55.47
3      24.48         54.38
4      24.51         54.54

我想为数据集的每个点找到最近的距离。我在互联网上发现了以下距离函数

from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

我正在使用以下函数

shortest_distance = []
for i in range(1,len(data)):
    distance1 = []
    for j in range(1,len(data)):
        distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
    shortest_distance.append(min(distance1))

但这段代码对每个条目循环两次,并返回n^2次迭代,因此速度非常慢。我的数据集包含近100万条记录,每次在所有元素中循环两次都会变得非常昂贵

我想找到更好的方法来找出每行最近的点。有谁能帮我找到用python解决这个问题的方法吗

谢谢


Tags: 数据函数in距离datasincosdistance
3条回答

你可以使用字典来散列一些计算。代码多次计算A到B的距离(A和B是数据集中的两个任意点)

实现您自己的缓存:

from math import radians, cos, sin, asin, sqrt

dist_cache = {}
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    try:
        return dist_cache[(lon1, lat1, lon2, lat2)]
    except KeyError:
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a)) 
        km = 6367 * c
        dist_cache[(lon1, lat1, lon2, lat2)] = km
        return km

或使用lru_cache

from math import radians, cos, sin, asin, sqrt
from functools import lru_cache

@lru_cache
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

你可以通过调用一个实现智能算法的库来实现very efficiently,一个例子是sklearn,它有一个^{}方法来实现这一点

为此修改的代码示例:

from sklearn.neighbors import NearestNeighbors
import numpy as np

def distance(p1, p2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    """
    lon1, lat1 = p1
    lon2, lat2 = p2
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

points = [[25.42, 55.47],
          [25.39, 55.47],
          [24.48, 54.38],
          [24.51, 54.54]]

nbrs = NearestNeighbors(n_neighbors=2, metric=distance).fit(points)

distances, indices = nbrs.kneighbors(points)

result = distances[:, 1]

>>> result
array([  1.889697  ,   1.889697  ,  17.88530556,  17.88530556])

找到离给定点最近的N点的蛮力方法是O(N),您必须检查每个点。 相反,如果N点存储在KD-tree中,则查找最近点平均是O(log(N))。 构建KD树还需要额外的一次性成本,这需要O(N)时间

如果需要重复此过程N次,则蛮力方法为O(N**2),kd树方法为O(N*log(N))。 因此,对于足够大的N,KD树将击败蛮力方法

有关最近邻算法(包括KD树)的更多信息,请参见here


下面(在函数using_kdtree中)是一种使用^{}计算最近邻的大圆弧长的方法

scipy.spatial.kdtree使用点之间的欧几里德距离,但有一个formula用于将球体上点之间的欧几里德弦距离转换为大圆弧长(给定球体半径)。 因此,我们的想法是将纬度/经度数据转换为笛卡尔坐标,使用KDTree查找最近的邻居,然后应用great circle distance formula获得所需的结果


以下是一些基准。使用N = 100using_kdtreeorig(蛮力)方法快39倍

In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop

In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop

In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop

对于N = 10000

In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop

In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop

In [7]: %timeit orig(data)
# untested; too slow

由于using_kdtreeO(N log(N)),而origO(N**2),因此 哪一个using_kdtreeorig快,将随着N的增长而增长 data,生长


import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt

R = 6367

def using_kdtree(data):
    "Based on https://stackoverflow.com/q/43020919/190597"
    def dist_to_arclength(chord_length):
        """
        https://en.wikipedia.org/wiki/Great-circle_distance
        Convert Euclidean chord length to great circle arc length
        """
        central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
        arclength = R*central_angle
        return arclength

    phi = np.deg2rad(data['Latitude'])
    theta = np.deg2rad(data['Longitude'])
    data['x'] = R * np.cos(phi) * np.cos(theta)
    data['y'] = R * np.cos(phi) * np.sin(theta)
    data['z'] = R * np.sin(phi)
    tree = spatial.KDTree(data[['x', 'y','z']])
    distance, index = tree.query(data[['x', 'y','z']], k=2)
    return dist_to_arclength(distance[:, 1])

def orig(data):
    def distance(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        c = 2 * asin(sqrt(a)) 
        km = R * c
        return km

    shortest_distance = []
    for i in range(len(data)):
        distance1 = []
        for j in range(len(data)):
            if i == j: continue
            distance1.append(distance(data['Longitude'][i], data['Latitude'][i], 
                                      data['Longitude'][j], data['Latitude'][j]))
        shortest_distance.append(min(distance1))
    return shortest_distance


def using_sklearn(data):
    """
    Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
    """
    def distance(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        """
        lon1, lat1 = p1
        lon2, lat2 = p2
        # convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        # haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))
        km = R * c
        return km
    points = data[['Longitude', 'Latitude']]
    nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
    distances, indices = nbrs.kneighbors(points)
    result = distances[:, 1]
    return result

np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 
                     'Longitude':np.random.uniform(0,360,size=N)})

expected = orig(data)
for func in [using_kdtree, using_sklearn]:
    result = func(data)
    assert np.allclose(expected, result)

相关问题 更多 >