使用Python进行尴尬并行的数据库更新(PostGIS/PostgreSQL)

3 投票
2 回答
4262 浏览
提问于 2025-04-17 02:41

我需要更新一个空间数据库里的每一条记录,这个数据库里有一组点的数据,这些点的数据覆盖了一组多边形的数据。对于每一个点,我想给它分配一个键,以便将它与它所在的多边形关联起来。比如,如果我的点“纽约市”位于多边形“美国”内,而这个美国的多边形的标识是'GID = 1',那么我就会给我的点“纽约市”分配'gid_fkey = 1'。

为此,我创建了以下查询。

procQuery = 'UPDATE city SET gid_fkey = gid FROM country  WHERE ST_within((SELECT the_geom FROM city WHERE wp_id = %s), country.the_geom) AND city_id = %s' % (cityID, cityID)

目前,我是通过另一个查询获取cityID的信息,这个查询会选择所有gid_fkey为NULL的cityID。实际上,我只需要遍历这些cityID,然后运行之前的查询。因为这个查询只依赖于另一个表里的静态信息,理论上这些过程可以同时运行。我已经实现了下面的多线程过程,但我似乎无法将其迁移到多进程。

import psycopg2, pprint, threading, time, Queue

queue = Queue.Queue()
pyConn = psycopg2.connect("dbname='geobase_1' host='localhost'")
pyConn.set_isolation_level(0)
pyCursor1 = pyConn.cursor()

getGID = 'SELECT cityID FROM city'
pyCursor1.execute(getGID)
gidList = pyCursor1.fetchall()

class threadClass(threading.Thread):

def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

def run(self):

        while True:
            gid = self.queue.get()

            procQuery = 'UPDATE city SET gid_fkey = gid FROM country  WHERE ST_within((SELECT the_geom FROM city WHERE wp_id = %s), country.the_geom) AND city_id = %s' % (cityID, cityID)

            pyCursor2 = pyConn.cursor()                         
            pyCursor2.execute(procQuery)

            print gid[0]                    
            print 'Done'

def main():

    for i in range(4):
        t = threadClass(queue)
        t.setDaemon(True)
        t.start()

        for gid in gidList:
            queue.put(gid)

    queue.join()

main()

我甚至不确定多线程是否是最优的,但肯定比一个一个地处理要快。

我将使用的机器有四个核心(四核),运行的是一个没有图形界面的最小Linux操作系统,还有PostgreSQL、PostGIS和Python,这些是否会有影响。

我需要做什么改变才能让这个简单的多进程任务得以实现呢?

2 个回答

1

在普通的SQL中,你可以这样做:

UPDATE city ci
SET gid_fkey = co.gid 
FROM country co 
WHERE ST_within(ci.the_geom , co.the_geom) 
AND ci.city_id = _some_parameter_
        ;

如果一个城市可能属于多个国家(这会导致对同一行数据进行多次更新),那么可能会出现问题,但在你的数据中,这种情况可能不会发生。

5

好吧,这是我自己帖子的问题的回答。真不错,给自己点赞 =D

在我的系统上,从单核线程切换到四核多进程,速度大约提高了150%。

import multiprocessing, time, psycopg2

class Consumer(multiprocessing.Process):

def __init__(self, task_queue, result_queue):
    multiprocessing.Process.__init__(self)
    self.task_queue = task_queue
    self.result_queue = result_queue

def run(self):
    proc_name = self.name
    while True:
        next_task = self.task_queue.get()
        if next_task is None:
            print 'Tasks Complete'
            self.task_queue.task_done()
            break            
        answer = next_task()
        self.task_queue.task_done()
        self.result_queue.put(answer)
    return


class Task(object):
def __init__(self, a):
    self.a = a

def __call__(self):        
    pyConn = psycopg2.connect("dbname='geobase_1' host = 'localhost'")
    pyConn.set_isolation_level(0)
    pyCursor1 = pyConn.cursor()

        procQuery = 'UPDATE city SET gid_fkey = gid FROM country  WHERE ST_within((SELECT the_geom FROM city WHERE city_id = %s), country.the_geom) AND city_id = %s' % (self.a, self.a)

    pyCursor1.execute(procQuery)
    print 'What is self?'
    print self.a

    return self.a

def __str__(self):
    return 'ARC'
def run(self):
    print 'IN'

if __name__ == '__main__':
tasks = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()

num_consumers = multiprocessing.cpu_count() * 2
consumers = [Consumer(tasks, results) for i in xrange(num_consumers)]
for w in consumers:
    w.start()

pyConnX = psycopg2.connect("dbname='geobase_1' host = 'localhost'")
pyConnX.set_isolation_level(0)
pyCursorX = pyConnX.cursor()

pyCursorX.execute('SELECT count(*) FROM cities WHERE gid_fkey IS NULL')    
temp = pyCursorX.fetchall()    
num_job = temp[0]
num_jobs = num_job[0]

pyCursorX.execute('SELECT city_id FROM city WHERE gid_fkey IS NULL')    
cityIdListTuple = pyCursorX.fetchall()    

cityIdList = []

for x in cityIdListTuple:
    cityIdList.append(x[0])


for i in xrange(num_jobs):
    tasks.put(Task(cityIdList[i - 1]))

for i in xrange(num_consumers):
    tasks.put(None)

while num_jobs:
    result = results.get()
    print result
    num_jobs -= 1

现在我又有一个问题,已经在这里发布了:

创建数据库连接并在多个进程中保持(多进程)

希望我们能减少一些开销,让这个程序运行得更快。

撰写回答