python多记录时空查询

2024-04-25 14:16:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个600000 x/y点的数据帧,其中有日期和时间信息,还有另一个字段“status”,还有额外的描述性信息

我的目标是,对于每个记录:

  • 按某个时空缓冲区内的记录对“status”列求和

特定的缓冲区在t-8小时和<;100米之内

目前我有熊猫数据框中的数据。在

我可以在行中循环,并对每个记录子集化感兴趣的日期,然后计算距离并进一步限制选择。不过,这么多的记录还是很慢的。在

  • 这需要运行4.4小时。在

我可以看到我可以创建一个以x,y,日期作为历元时间的三维kdtree。然而,我不确定在合并日期和地理距离时如何适当地限制距离。在

下面是一些可复制的代码供你们测试:

进口

import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta

创建数据

在np.随机.种子(111)

函数生成测试数据

^{pr2}$

加速功能

def work(df):

    output = []
    #loop through data index's
    for i in range(0, len(df)):
    l = []
        #first we will filter out the data by date to have a smaller list to compute distances for

        #create a mask to query all dates between range for date i
        date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
        #create a mask to query all users who are not user i (themselves)
        user_mask = df['user']!=df['user'].iloc[i]
        #apply masks
        dists_to_check = df[date_mask & user_mask]

        #for point i, create coordinate to calculate distances from
        a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
        #create array of distances to check on the masked data
        b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))

        #for j in the date queried data
        for j in range(1, len(dists_to_check)):
            #compute the ueclidean distance between point a and each point of b (the date masked data)
            x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))

            #if the distance is within our range of interest append the index to a list
            if x <=100:
                l.append(j)
            else:
                pass
        try:
            #use the list of desired index's 'l' to query a final subset of the data
            data = dists_to_check.iloc[l]
            #summarize the column of interest then append to output list
            output.append(data['status'].sum())
        except IndexError, e:
            output.append(0)
            #print "There were no data to add"

    return pd.DataFrame(output)

运行代码并计时

start = datetime.now()
out = work(data)
print datetime.now() - start

有没有一种方法可以以矢量化的方式进行此查询?或者我应该追求另一种技术。在

<;3


Tags: ofthetoimportdfforoutputdata
1条回答
网友
1楼 · 发布于 2024-04-25 14:16:11

这至少在一定程度上解决了我的问题。由于循环可以独立地操作数据的不同部分,所以并行化在这里是有意义的。在

使用Ipython

from IPython.parallel import Client
cli = Client()
cli.ids

cli = Client()
dview=cli[:]

with dview.sync_imports():
    import numpy as np
    import os
    from datetime import timedelta
    import pandas as pd

#We also need to add the time deltas and output list into the function as 
#local variables as well as add the Ipython.parallel decorator

@dview.parallel(block=True)
def work(df):
    before = timedelta(hours = 8)
    after = timedelta(minutes = 1)
    output = []

最终时间1:17:54.910206,约为原始时间的1/4

我仍然非常感兴趣,任何人都可以在函数体中提出一些小的速度改进建议。在

相关问题 更多 >