Python 内存泄漏 - 已解决,但仍感困惑
我成功地解决了自己遇到的内存泄漏问题。不过,我发现了一些很奇怪的现象。
for fid, fv in freqDic.iteritems():
outf.write(fid+"\t") #ID
for i, term in enumerate(domain): #Vector
tfidf = self.tf(term, fv) * self.idf( term, docFreqDic)
if i == len(domain) - 1:
outf.write("%f\n" % tfidf)
else:
outf.write("%f\t" % tfidf)
outf.flush()
print "Memory increased by", int(self.memory_mon.usage()) - startMemory
outf.close()
def tf(self, term, freqVector):
total = freqVector[TOTAL]
if total == 0:
return 0
if term not in freqVector: ## When you don't have these lines memory leaks occurs
return 0 ##
return float(freqVector[term]) / freqVector[TOTAL]
def idf(self, term, docFrequencyPerTerm):
if term not in docFrequencyPerTerm:
return 0
return math.log( float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term])
简单来说,我的问题是这样的: 1) 我在做tfidf计算。 2) 我追踪到内存泄漏的源头是来自defaultdict。 3) 我使用了来自如何获取Python中的当前CPU和RAM使用情况?的memory_mon工具。 4) 我内存泄漏的原因如下:a) 在self.tf中,如果没有加上这行代码:if term not in freqVector: return 0,就会导致内存泄漏。(我自己用memory_mon验证过,发现内存急剧增加,而且一直在增加)
我问题的解决方法是1) 由于fv是一个defaultdict,任何在fv中找不到的引用都会创建一个新条目。在一个非常大的范围内,这会导致内存泄漏。
我决定用普通的字典(dict)来代替defaultdict,结果内存问题就解决了。
我唯一困惑的是:既然fv是在“for fid, fv in freqDic.iteritems():”中创建的,难道fv不应该在每次循环结束时被销毁吗?我试着在循环结束时加上gc.collect(),但gc并没有能收集到所有的东西(返回0)。是的,假设是对的,但如果每次循环都能销毁所有临时变量,内存应该保持相对稳定才对。
这是在self.tf中加上那两行代码后的效果:
Memory increased by 12
Memory increased by 948
Memory increased by 28
Memory increased by 36
Memory increased by 36
Memory increased by 32
Memory increased by 28
Memory increased by 32
Memory increased by 32
Memory increased by 32
Memory increased by 40
Memory increased by 32
Memory increased by 32
Memory increased by 28
而没有那两行代码的效果:
Memory increased by 1652
Memory increased by 3576
Memory increased by 4220
Memory increased by 5760
Memory increased by 7296
Memory increased by 8840
Memory increased by 10456
Memory increased by 12824
Memory increased by 13460
Memory increased by 15000
Memory increased by 17448
Memory increased by 18084
Memory increased by 19628
Memory increased by 22080
Memory increased by 22708
Memory increased by 24248
Memory increased by 26704
Memory increased by 27332
Memory increased by 28864
Memory increased by 30404
Memory increased by 32856
Memory increased by 33552
Memory increased by 35024
Memory increased by 36564
Memory increased by 39016
Memory increased by 39924
Memory increased by 42104
Memory increased by 42724
Memory increased by 44268
Memory increased by 46720
Memory increased by 47352
Memory increased by 48952
Memory increased by 50428
Memory increased by 51964
Memory increased by 53508
Memory increased by 55960
Memory increased by 56584
Memory increased by 58404
Memory increased by 59668
Memory increased by 61208
Memory increased by 62744
Memory increased by 64400
我期待你的回答。
编辑: 看起来我的术语可能用错了(或者看起来是错的)。
- 我提到的内存泄漏并不是由freqVector[term]引起的(在defaultdict中查找一个不存在的键)。
- 我真正说的内存泄漏是来自
for fid, fv in freqDic.iteritems()
的内存泄漏!!我知道fv的大小因为1)而增加,但它应该在循环结束时被销毁!内存不应该一直扩张。这不是内存泄漏吗?
3 个回答
我怀疑Python的内存使用量可能在增加,因为在Python中,浮点数也是对象。解释器会维护一个浮点数的空闲列表,这个列表是没有上限的,也就是说它会一直存在。因此,每当进行浮点数计算时,如果产生了一个之前没有的新的浮点数,Python就会在空闲列表中分配一个新的浮点对象,并且会把这个对象保留着,以备后续可能需要用到。
在Python的错误追踪器中有类似的讨论,您可以在这里查看。
这不是内存泄漏,因为内存并没有流失,而是被你的默认字典占用了,比如说:
from collections import defaultdict
d = defaultdict(int)
for i in xrange(10**7):
a = d[i]
你觉得这是内存泄漏吗?你在给字典赋值,所以内存使用量应该会增加,这和下面这个很像:
d = {}
for i in xrange(10**7):
d[i] = 0
这并不是内存泄漏。
遍历 freqDict
时,并不会生成新的值,而是传递了字典中已经存在的值的引用。这意味着即使在循环结束后,你仍然可以向 freqDict
中的 fv 添加新值。
另一种解决办法是在遍历完 freqDict
后清空它。
总的来说,Python 是通过引用来传递所有东西的,尽管有时候看起来不是这样。字符串和整数是不可变的,如果它们被改变,代表它们的对象会被替换。