数据帧重新索引对象不必要地保留在内存中

2024-03-28 19:36:47 发布

您现在位置:Python中文网/ 问答频道 /正文

在继续从this question开始,我实现了两个函数,一个在使用重新索引,另一个没有。第三行的功能不同:

def update(centroid):
    best_mean_dist = 200
    clust_members = members_by_centeriod[centroid]
    for member in clust_members:
        member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist

def update1(centroid):
    best_mean_dist = 200
    members_in_clust = members_by_centeriod[centroid]
    new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
    for member in members_in_clust:
        member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()        

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist  

正在从IPython笔记本单元调用函数:

^{pr2}$

dataframedf是一个大数据帧,大约有400万行,占用~300MB内存。在

使用重新索引的update1函数要快得多。但是,一些意想不到的事情发生了-在运行一个重新索引的迭代之后,内存很快就从~300MB增加到1.5GB,然后我得到内存冲突。在

update函数不受这种行为的影响。有两件事我得不到:

  1. 很明显,重新索引会产生副本。但是,每次update1函数完成时,副本不都会消亡吗?newdf变量应该随着创建它的函数而消亡。。正确的?

  2. 即使垃圾回收器没有立即杀死newdf,一个内存用完了,它也应该杀死它而不是引发outOfMemory异常,对吗?

  3. 我试图在update1函数的末尾添加del newdf来手动杀死df,但没有帮助。那么,这是否意味着bug实际上是在重新索引过程中呢?

编辑:

我发现了问题,但我不明白为什么会有这种行为。它是python垃圾收集器,拒绝清理重新编制索引的数据帧。 这是有效的:

for i in range(2000):
   new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)

这也是有效的:

def reindex():
    new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
    score  = 100 - new_df.ix[member].ix[clust_members].score.mean()
    return score

for i in range(2000):
    reindex()

这会导致在内存中保留重新索引对象:

z = []    
for i in range(2000):
    z.append(reindex()) 

我认为我的用法是天真的正确。newdf变量如何与得分值保持连接,为什么?在


Tags: 函数indffordistmeanlevelmember
1条回答
网友
1楼 · 发布于 2024-03-28 19:36:47

这是我的调试代码,当你做索引时,Index object会创建_tuples和{},我想内存是由这两个缓存对象使用的。如果我添加****标记的行,那么内存的增加非常小,在我的电脑上大约是6米:

import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc

def get_memory():
    pid = os.getpid()
    p = psutil.Process(pid)
    return p.get_memory_info().rss

def get_object_ids():
    return set(id(obj) for obj in gc.get_objects())

m1 = get_memory()

n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])

ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))

m2 = get_memory()
objs1 = get_object_ids()

z = []
for i in range(5):
    df2 = df.reindex(ix, level=0).reindex(iy, level=1)
    z.append(df2.mean().mean())
df.index._tuples = None    # ****
df.index._cleanup()        # ****
del df2
gc.collect()               # ****
m3 = get_memory()

print (m2-m1)/1e6, (m3-m2)/1e6

from collections import Counter

counter = Counter()
for obj in gc.get_objects():
    if id(obj) not in objs1:
        typename = type(obj).__name__
        counter[typename] += 1
print counter

相关问题 更多 >