一次从临时表中获取n条记录
我有一个临时表,里面大约有100万条记录。这个临时表是从一个更大的查询结果中得到的。我想一次处理1000条记录,比如说。有什么好的方法可以设置查询,让我先得到前1000行,然后再得到下一组1000行,依此类推?这些记录本身没有特定的顺序,不过临时表里有一列是ID,所以如果需要的话,我可以对它进行排序。我在考虑给临时表增加一个额外的列,用来给所有行编号,类似于:
CREATE TEMP TABLE tmptmp AS
SELECT ##autonumber somehow##, id
FROM .... --complicated query
然后我可以这样做:
SELECT * FROM tmptmp WHERE autonumber>=0 AND autonumber < 1000
等等……我该怎么实际操作呢?或者有没有更好的方法?我在使用Python和PostgreSQL。
2 个回答
1
也许你可以用这样的方式(我们在批量更新一个有超过2000万行的表时会用到这个方法,这样可以避免占用太多的复制资源)。
import sys
import psycopg2
from datetime import datetime
firstid = 0
splitsize = 50 # Size of each batch
# Complicated query
query_complex = """
CREATE TEMP TABLE tmptmp AS
SELECT * FROM schema.massive_table
"""
# Query to be run at intervals
query = """
SELECT * FROM tmptmp WHERE id BETWEEN %(startid)s AND %(endid)s
"""
conn = psycopg2.connect("dbname=database_name user=postgres")
curs = conn.cursor()
# Run complicated query
curs.execute(query_complex)
# Get highest id
curs.execute("SELECT max(id) FROM tmptmp")
maxid = curs.fetchall()[0][0]
print "Max id: %s" % maxid
for startid in range(firstid, maxid, splitsize):
endid = startid + splitsize - 1
print "%s: Running query on range %s to %s" % (datetime.now(), startid, endid)
curs.execute(query, {'startid':startid, 'endid':endid})
print "%s: Affected rows: %s. Total completed: %s%%" % (datetime.now(), curs.rowcount, round((endid * 100) / maxid, 3))
print "Done."
接下来是输出结果:
Max id: 308
2010-06-18 11:59:11.271000: Running query on range 0 to 49
2010-06-18 11:59:11.271000: Affected rows: 49. Total completed: 15.0%
2010-06-18 11:59:11.271000: Running query on range 50 to 99
2010-06-18 11:59:11.271000: Affected rows: 50. Total completed: 32.0%
2010-06-18 11:59:11.271000: Running query on range 100 to 149
2010-06-18 11:59:11.271000: Affected rows: 50. Total completed: 48.0%
2010-06-18 11:59:11.271000: Running query on range 150 to 199
2010-06-18 11:59:11.271000: Affected rows: 49. Total completed: 64.0%
2010-06-18 11:59:11.271000: Running query on range 200 to 249
2010-06-18 11:59:11.271000: Affected rows: 42. Total completed: 80.0%
2010-06-18 11:59:11.271000: Running query on range 250 to 299
2010-06-18 11:59:11.318000: Affected rows: 3. Total completed: 97.0%
2010-06-18 11:59:11.318000: Running query on range 300 to 349
2010-06-18 11:59:11.318000: Affected rows: 1. Total completed: 113.0%
Done.
// 约翰
4
使用游标来获取你需要的行。当你的记录很多时,使用偏移量和限制会变得很慢,而游标会表现得更好。
http://www.postgresql.org/docs/8.4/interactive/sql-fetch.html