我有一个600000 x/y点的数据帧,其中有日期和时间信息,还有另一个字段“status”,还有额外的描述性信息
我的目标是,对于每个记录:
特定的缓冲区在t-8小时和<;100米之内
目前我有熊猫数据框中的数据。在
我可以在行中循环,并对每个记录子集化感兴趣的日期,然后计算距离并进一步限制选择。不过,这么多的记录还是很慢的。在
我可以看到我可以创建一个以x,y,日期作为历元时间的三维kdtree。然而,我不确定在合并日期和地理距离时如何适当地限制距离。在
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
在np.随机.种子(111)
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
start = datetime.now()
out = work(data)
print datetime.now() - start
有没有一种方法可以以矢量化的方式进行此查询?或者我应该追求另一种技术。在
<;3
这至少在一定程度上解决了我的问题。由于循环可以独立地操作数据的不同部分,所以并行化在这里是有意义的。在
使用Ipython
最终时间1:17:54.910206,约为原始时间的1/4
我仍然非常感兴趣,任何人都可以在函数体中提出一些小的速度改进建议。在
相关问题 更多 >
编程相关推荐