基于经纬度的python数据集匹配

2024-05-19 01:35:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据集,数据集有经度和纬度值。在

让我们说:

  • 点x1是(lang_1,latt_1)
  • 点x2是(lang_2,latt_2)
  • the first dataset has "n" rows of data with point_x1, x1
  • the second dataset has "m" rows of data with point_x2, x2

其中m>;n

编辑:注意:m将为20000或更多,n将为5000或更多。

我想对两个数据集进行分组或合并。在

我想找出每个点的最接近点 然后 想为dataset2中的每一行创建一个新的数据point_x2, x2, x1(其中点x1最接近点x2)。在

数据集1示例:

-91.850532 40.376043 x1_a1
-91.850519 40.376043 x1_a2
-91.850504 40.376043 x1_a3
-91.850487 40.376043 x1_a4
-91.850399 40.376044 x1_a5
-91.850353 40.376044 x1_a6

数据集2示例:

^{pr2}$

我不太懂数据科学或地理分析。在方法上寻求帮助。在

请建议我怎么做。在


Tags: ofthe数据示例langdatawithdataset
2条回答

我写了一些样本密码。你可以这样尝试:

from math import radians, cos, sin, asin, sqrt
import pandas as pd

def geo_distance(lng1,lat1,lng2,lat2):
    lng1, lat1, lng2, lat2 = map(radians, [lng1, lat1, lng2, lat2])
    dlon=lng2-lng1
    dlat=lat2-lat1
    a=sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 
    dis=2*asin(sqrt(a))*6371*1000
    return dis



df1 = pd.DataFrame({'lang_1':[-91.850532,-91.850519,-91.850504,-91.850487,-91.850399,-91.850353],
                    'latt_1':[40.376043,40.376043,40.376043,40.376043,40.376044,40.376044],
                    'x1':['x1_a1','x1_a2','x1_a3','x1_a4','x1_a5','x1_a6']})
df2 = pd.DataFrame({'lang_2':[-91.848442,-91.850292,-91.849919,-91.849109,-91.845884,-91.847344,-91.846937,-91.849827,-91.850149,-91.848569,-91.849063,-91.845563],
                    'latt_2':[40.380573,40.378533,40.377883,40.385833,40.381623,40.376693,40.382653,40.381343,40.383474,40.384904,40.377384,40.378604],
                    'x2':['x2_a0','x2_a1','x2_a2','x2_a3','x2_a4','x2_a5','x2_a6','x2_a7','x2_a8','x2_a9','x2_a10','x2_a11']})

df1['key']=0
df2['key']=0

df_cartesian = df2.merge(df1, how='outer')
df_cartesian['geo_distance']=df_cartesian.apply(lambda row:geo_distance(row['lang_1'],row['latt_1'],row['lang_2'],row['latt_2']),axis=1)
df_cartesian_min_distance=df_cartesian.sort_values(by="geo_distance").groupby(["lang_2","latt_2","x2"],as_index=False).first()
print(df_cartesian_min_distance.ix[:,["lang_2","latt_2","x2","x1"]])

我不确定它是否有用,但我想出了一个比威廉更紧凑的版本:

import pandas

dataset1 = pandas.DataFrame(data={'x':(-91.850532, -91.850519, -91.850504, -91.850487, -91.850399, -91.850353),
                                  'y':(40.376043, 40.376043,  0.376043, 40.376043, 40.376044, 40.376044)},
                            index=('x1_a1', 'x1_a2', 'x1_a3', 'x1_a4', 'x1_a5', 'x1_a6'))


dataset2 = pandas.DataFrame(data={'x':(-91.848442, -91.850292, -91.849919, -91.849109, -91.845884, -91.847344, -91.846937, -91.849827, -91.850149, -91.848569, -91.849063, -91.845563),
                                  'y':(40.380573, 40.378533, 40.377883, 40.385833, 40.381623, 40.376693, 40.382653, 40.381343, 40.383474, 40.384904, 40.377384, 40.378604)},
                            index=('x2_a0', 'x2_a1', 'x2_a2', 'x2_a3', 'x2_a4', 'x2_a5', 'x2_a6', 'x2_a7', 'x2_a8', 'x2_a9', 'x2_a10', 'x2_a11'))

closest_points = {}
for name, point in dataset1.iterrows():
    distances = (((dataset2 - point) ** 2).sum(axis=1)**.5)
    closest_points[name] = distances.sort_values().index[0]

它在两组点之间使用简单的欧几里得,对于dataset1中的每个点,获取dataset2中离它最近的点的名称。我相信从现在开始你可以很容易地适应你的需要。在

相关问题 更多 >

    热门问题