从Pandas Datafram读取值时内存泄漏

import os, gc import psutil, pandas as pd N_ITER = 100000 DF_SIZE = 10000 # Define the DataFrame df = pd.DataFrame(index=range(DF_SIZE), columns=['my_col']) df['my_col'] = range(DF_SIZE) def memory_usage(): """Return the memory usage of the current python process.""" return psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2 if __name__ == '__main__': for i in range(N_ITER): df_ind = pd.DataFrame(df.copy()) val = df_ind.at[4242, 'my_col'] # The line that provokes the leak! del df_ind, val # Useless # gc.collect() # Garbage Collector prevents the leak but is slow if (i % 1000) == 0: print('Iter {}\t {} MB'.format(i, int(memory_usage())))

1条回答

网友

1楼 · 发布于 2024-04-17 19:50:23

好吧，看起来真正的痛苦来自于df_ind的创建方式。在

使用引用到原始数据帧df似乎可行，但是如果我们打算修改{}，则可能会有点风险。在

使用原始数据帧df的副本会触发内存泄漏。可能有一些来自df的无用元素的隐式副本。这些复制的元素不被del，捕获，而是被gc.collect()捕获。这是一个时间成本，因为这个操作需要时间。在

下面列出了解决此内存泄漏的不同尝试及其结果：

df_ind = df                    # Works! Dangerous since df could be modified

df_ind = copy.copy(df)         # Works! Equivalent to df_ind = df
df_ind = df.copy.deepcopy(df)  # Fails.

df_ind = df.copy(deep=False)   # Works! Equivalent to df_ind = df
df_ind = df.copy(deep=True)    # Fails.

总而言之：

如果想要修改temp dataframe，那么不要使用pandas。你可以使用字典或压缩列表来得到你想要的。
如果不想修改temp dataframe，那么使用pandas和显式选项df_ind = df.copy(deep=False)

相关问题更多 >

编程相关推荐

热门问题

热门文章