从Postgres数据库获取数据时内存使用过多

2024-06-01 02:40:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直在使用Python从Postgres数据库获取数据。它占用了很多记忆。如下所示:

memory usage

以下函数是我正在运行的唯一函数,它占用了过多的内存。我正在使用fetchmany()并获取小块数据。我还尝试迭代地使用cur光标。然而,所有这些方法最终都会占用大量内存。有人知道为什么会这样吗?有没有什么我需要在Postgres端调整的,可以帮助缓解这个问题??在

def checkMultipleLine(dbName):
    '''
    Checks for rows that contain data spanning multiple lines

    This is the most basic of checks. If a aprticular row has 
    data that spans multiple lines, then that particular row
    is corrupt. For dealing with these rows we must first find 
    out whether there are places in the database that contains
    data that spans multiple lines. 
    '''

    logger = logging.getLogger('mindLinc.checkSchema.checkMultipleLines')
    logger.info('Finding rows that span multiple lines')

    schema = findTables(dbName)

    results = []
    for t in tqdm(sorted(schema.keys())):

        conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
        cur  = conn.cursor()
        cur.execute('select * from %s'%t)
        n = 0
        N = 0
        while True:
            css = cur.fetchmany(1000)
            if css == []: break
            for cs in css:
                N += 1
                if any(['\n' in c for c in cs if type(c)==str]):
                    n += 1
        cur.close()
        conn.close()

        tqdm.write('[%40s] -> [%5d][%10d][%.4e]'%(t, n, N, n/(N+1.0)))
        results.append({
            'tableName': t,
            'totalRows': N,
            'badRows'  : n,
        })


    logger.info('Finished checking for multiple lines')

    results = pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
    print results
    results.to_csv('error_MultipleLine[%s].csv'%(dbName), index=False)

    return results

Tags: infordataifthatpostgresmultiplelogger