从Postgres数据库获取数据时内存使用过多

def checkMultipleLine(dbName): ''' Checks for rows that contain data spanning multiple lines This is the most basic of checks. If a aprticular row has data that spans multiple lines, then that particular row is corrupt. For dealing with these rows we must first find out whether there are places in the database that contains data that spans multiple lines. ''' logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines') logger.info('Finding rows that span multiple lines') schema = findTables(dbName) results = [] for t in tqdm(sorted(schema.keys())): conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName) cur = conn.cursor() cur.execute('select * from %s'%t) n = 0 N = 0 while True: css = cur.fetchmany(1000) if css == []: break for cs in css: N += 1 if any(['\n' in c for c in cs if type(c)==str]): n += 1 cur.close() conn.close() tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0))) results.append({ 'tableName': t, 'totalRows': N, 'badRows' : n, }) logger.info('Finished checking for multiple lines') results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']] print results results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False) return results

1条回答

网友

1楼 · 发布于 2024-06-01 02:40:26

Psycopg2支持将server-side cursors用于此answer中所述的大型查询。下面是如何将其与客户端缓冲区设置一起使用：

cur = conn.cursor('cursor-name')
cur.itersize = 10000  # records to buffer on a client

这样可以减少内存占用。在

相关问题更多 >

编程相关推荐

热门问题

热门文章