为什么这个Python方法泄漏内存?
这个方法会遍历数据库中的一个术语列表,检查这些术语是否出现在作为参数传入的文本中。如果找到了某个术语,就把它替换成一个指向搜索页面的链接,链接中包含这个术语作为参数。
术语的数量非常多(大约有100000个),所以这个过程比较慢,但没关系,因为它是作为定时任务(cron job)来执行的。不过,这样会导致脚本的内存使用量急剧增加,我找不到原因:
class SearchedTerm(models.Model):
[...]
@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
"""
Take a list of all researched terms and search them in the
text. If they exist, turn them into links to the search
page.
This process is limited to `count` replacements maximum.
WARNING: because the sites got different URLS schemas, we don't
provides direct links, but we inject the {% url %} tag
so it must be rendered before display. You can use the `eval`
tag from `libs` for this. Since they got different namespace as
well, we enter a generic 'namespace' and delegate to the
template to change it with the proper one as well.
If you have a batch process to do, you can pass a query set
that will be used instead of getting all searched term at
each calls.
"""
found = 0
terms = queryset or cls.on_site.all()
# to avoid duplicate searched terms to be replaced twice
# keep a list of already linkified content
# added words we are going to insert with the link so they won't match
# in case of multi passes
processed = set((u'video', u'streaming', u'title',
u'search', u'namespace', u'href', u'title',
u'url'))
for term in terms:
text = term.text.lower()
# no small word and make
# quick check to avoid all the rest of the matching
if len(text) < 3 or text not in string:
continue
if found and cls._is_processed(text, processed):
continue
# match the search word with accent, for any case
# ensure this is not part of a word by including
# two 'non-letter' character on both ends of the word
pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text,
re.UNICODE|re.IGNORECASE)
if re.search(pattern, string):
found += 1
# create the link string
# replace the word in the description
# use back references (\1, \2, etc) to preserve the original
# formatin
# use raw unicode strings (ur"string" notation) to avoid
# problems with accents and escaping
query = '-'.join(term.text.split())
url = ur'{%% url namespace:static-search "%s" %%}' % query
replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url
string = re.sub(pattern, replace_with, string)
processed.add(text)
if found >= 3:
break
return string
你可能也想要这段代码:
class SearchedTerm(models.Model):
[...]
@classmethod
def _is_processed(cls, text, processed):
"""
Check if the text if part of the already processed string
we don't use `in` the set, but `in ` each strings of the set
to avoid subtring matching that will destroy the tags.
This is mainly an utility function so you probably won't use
it directly.
"""
if text in processed:
return True
return any(((text in string) for string in processed))
我觉得这里可能只有两个对象是可疑的:terms
和 processed
。但我看不出有什么理由让它们不被垃圾回收。
编辑:
我想我应该提一下,这个方法是在Django模型的方法内部调用的。我不知道这是否重要,但这是代码:
class Video(models.Model):
[...]
def update_html_description(self, links=3, queryset=None):
"""
Take a list of all researched terms and search them in the
description. If they exist, turn them into links to the search
engine. Put the reset into `html_description`.
This use `add_search_link_to_text` and has therefor, the same
limitations.
It DOESN'T call save().
"""
queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
text = self.description or self.title
self.html_description = SearchedTerm.add_search_links_to_text(text,
links,
queryset)
我可以想象,Python的正则表达式自动缓存会占用一些内存。但它应该只占用一次内存,而每次调用update_html_description
时,内存使用量却在不断增加。
问题不仅仅是它消耗了大量内存,更大的问题是它不释放内存:每次调用大约占用3%的内存,最终会填满内存,导致脚本崩溃,出现“无法分配内存”的错误。
4 个回答
1
确保你没有在调试模式下运行。
2
我完全找不到问题的原因,不过现在我暂时绕过这个问题,方法是把那个臭名昭著的代码片段放到一个脚本里,然后用 subprocess
来调用这个脚本。这样内存使用量会增加,但当然,在 Python 进程结束后,内存又会恢复正常。
真是个脏活。
但目前我就只有这些信息了。
3
当你调用整个查询集的时候,所有的数据都会一次性加载到内存中,这样会占用很多内存。如果结果集很大,你应该分批获取结果,这样虽然可能会多向数据库请求几次,但会大大减少内存的使用。