使用geopy pandas的新坐标列

2024-06-11 09:33:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我有测向仪:

import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty

df


     city_name  state_name  county_name
0    WASHINGTON  DC  DIST OF COLUMBIA
1    WASHINGTON  DC  DIST OF COLUMBIA
2    WASHINGTON  DC  DIST OF COLUMBIA
3    WASHINGTON  DC  DIST OF COLUMBIA
4    WASHINGTON  DC  DIST OF COLUMBIA
5    WASHINGTON  DC  DIST OF COLUMBIA
6    WASHINGTON  DC  DIST OF COLUMBIA
7    WASHINGTON  DC  DIST OF COLUMBIA
8    WASHINGTON  DC  DIST OF COLUMBIA
9    WASHINGTON  DC  DIST OF COLUMBIA

我想得到下面数据框中任意一列的经纬度坐标。在处理各个位置的文档时,文档(http://geopy.readthedocs.org/en/latest/#data)非常简单。

>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York,     ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}

不过,我想将该函数应用于df中的每一行并创建一个新列。我已经试过了

df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))

但我想我在代码中遗漏了一些东西,因为我得到了以下信息:

    city_name   state_name  county_name coordinates
0    WASHINGTON  DC  DIST OF COLUMBIA    None
1    WASHINGTON  DC  DIST OF COLUMBIA    None
2    WASHINGTON  DC  DIST OF COLUMBIA    None
3    WASHINGTON  DC  DIST OF COLUMBIA    None
4    WASHINGTON  DC  DIST OF COLUMBIA    None
5    WASHINGTON  DC  DIST OF COLUMBIA    None
6    WASHINGTON  DC  DIST OF COLUMBIA    None
7    WASHINGTON  DC  DIST OF COLUMBIA    None
8    WASHINGTON  DC  DIST OF COLUMBIA    None
9    WASHINGTON  DC  DIST OF COLUMBIA    None

我希望使用Lambda函数得到这样的结果:

     city_name  state_name  county_name  city_coord
0    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
1    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
2    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
3    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
4    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
5    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
6    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
7    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
8    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
9    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456
10   GLYNCO      GA  GLYNN               31.2224512, -81.5101023

我很感激你的帮助。在我得到坐标后,我想绘制它们。任何推荐的坐标映射资源也非常感谢。谢谢


Tags: ofnamefromimportnonecitydistas
2条回答

您可以调用apply,并传递要在每一行上执行的函数,如下所示:

In [9]:

geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
    city_name state_name       county_name  \
0  WASHINGTON         DC  DIST OF COLUMBIA   
1  WASHINGTON         DC  DIST OF COLUMBIA   

                                          city_coord  
0  (District of Columbia, United States of Americ...  
1  (District of Columbia, United States of Americ...  

然后可以访问纬度和经度属性:

In [16]:

df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

或者在一行中通过调用apply两次来完成:

In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df

Out[17]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

同样,您的尝试geolocator.geocode(lambda row: 'state_name' (row))也没有起到任何作用,因此为什么您有一个充满None值的列

编辑

@leb在这里提出了一个有趣的观点,如果您有许多重复的值,那么对于每个唯一的值使用geocode会更有效率,然后添加以下内容:

In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d

Out[38]:
{'DC': (38.8937154, -76.9877934586326)}

In [40]:    
df['city_coord'] = df['state_name'].map(d)
df

Out[40]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

因此,上面使用unique获取所有的唯一值,从它们构造一个dict,然后调用map来执行查找并添加coords,这将比试图对行进行地理编码更有效

投票接受@EdChum的回答,我只是想补充一下。他的方法很好,但从个人经验来看,我想分享几点:

在处理地理编码时,如果您有多个重复的城市/州组合,则只发送1个进行地理编码,然后将其余行复制到下面的其他行会快得多:

这对通过两种方式获得大数据非常有帮助:

  1. 仅基于您的数据,因为行看起来完全重复,而且只有在您需要时,才删除多余的行并对其中一行执行地理编码。这可以使用drop_duplicate完成
  2. 如果要保留所有行,group_by城市/州组合,请通过调用head(1)对第一行应用地理编码,然后复制到其余行。

原因是每次调用nominam时都会有一个小的延迟问题,即使您是在连续排队等待同一个城市/州。当您的数据变大导致响应的巨大延迟和可能的超时时,这种small延迟会变得更糟。

再说一遍,这都是从个人角度来处理的。如果现在对你没有好处,记住以后用。

相关问题 更多 >