我一直在使用Python从Postgres数据库获取数据。它占用了很多记忆。如下所示:
以下函数是我正在运行的唯一函数,它占用了过多的内存。我正在使用fetchmany()
并获取小块数据。我还尝试迭代地使用cur
光标。然而,所有这些方法最终都会占用大量内存。有人知道为什么会这样吗?有没有什么我需要在Postgres端调整的,可以帮助缓解这个问题??在
def checkMultipleLine(dbName):
'''
Checks for rows that contain data spanning multiple lines
This is the most basic of checks. If a aprticular row has
data that spans multiple lines, then that particular row
is corrupt. For dealing with these rows we must first find
out whether there are places in the database that contains
data that spans multiple lines.
'''
logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
logger.info('Finding rows that span multiple lines')
schema = findTables(dbName)
results = []
for t in tqdm(sorted(schema.keys())):
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
cur.execute('select * from %s'%t)
n = 0
N = 0
while True:
css = cur.fetchmany(1000)
if css == []: break
for cs in css:
N += 1
if any(['\n' in c for c in cs if type(c)==str]):
n += 1
cur.close()
conn.close()
tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
results.append({
'tableName': t,
'totalRows': N,
'badRows' : n,
})
logger.info('Finished checking for multiple lines')
results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
print results
results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)
return results
Psycopg2支持将server-side cursors用于此answer中所述的大型查询。下面是如何将其与客户端缓冲区设置一起使用:
这样可以减少内存占用。在
相关问题 更多 >
编程相关推荐