带有pandas数据框的矢量化Haversine公式
我知道要计算两个经纬度点之间的距离,需要使用haversine函数:
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
我有一个数据表,其中一列是纬度,另一列是经度。我想知道这些点距离一个固定点(-56.7213600, 37.2175900)有多远。那我该怎么把数据表里的值放进这个函数里呢?
示例数据表:
SEAZ LAT LON
1 296.40, 58.7312210, 28.3774110
2 274.72, 56.8148320, 31.2923240
3 192.25, 52.0649880, 35.8018640
4 34.34, 68.8188750, 67.1933670
5 271.05, 56.6699880, 31.6880620
6 131.88, 48.5546220, 49.7827730
7 350.71, 64.7742720, 31.3953780
8 214.44, 53.5192920, 33.8458560
9 1.46, 67.9433740, 38.4842520
10 273.55, 53.3437310, 4.4716664
1 个回答
35
我不能确认这些计算是否正确,但下面的代码是有效的:
In [11]:
from numpy import cos, sin, arcsin, sqrt
from math import radians
def haversine(row):
lon1 = -56.7213600
lat1 = 37.2175900
lon2 = row['LON']
lat2 = row['LAT']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km
df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
SEAZ LAT LON distance
index
1 296.40 58.731221 28.377411 6275.791920
2 274.72 56.814832 31.292324 6509.727368
3 192.25 52.064988 35.801864 6990.144378
4 34.34 68.818875 67.193367 7357.221846
5 271.05 56.669988 31.688062 6538.047542
6 131.88 48.554622 49.782773 8036.968198
7 350.71 64.774272 31.395378 6229.733699
8 214.44 53.519292 33.845856 6801.670843
9 1.46 67.943374 38.484252 6418.754323
10 273.55 53.343731 4.471666 4935.394528
这段代码在处理这么小的数据表时其实比较慢,不过我把它应用到了一个有10万行的数据表上:
In [35]:
%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))
1 loops, best of 3: 17.2 ms per loop
相比之下,使用apply函数花了4.3秒,所以这段代码快了将近250倍,这点在未来要记住。
如果我们把上面的内容压缩成一行代码:
In [39]:
%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop
我们观察到速度进一步提升,现在快了大约341倍。