可以增加Python进程使用的内存吗

0 投票

2 回答

940 浏览

提问于 2025-04-16 19:34

我在一台有64GB内存的Windows服务器上进行分类和特征提取的任务，但不知怎么的，Python却认为我快没内存了：

misiti@fff /cygdrive/c/NaiveBayes
$ python run_classify_comments.py > tenfoldcrossvalidation.txt
Traceback (most recent call last):
  File "run_classify_comments.py", line 70, in <module>
    run_classify_comments()
  File "run_classify_comments.py", line 51, in run_classify_comments
    NWORDS = get_all_words("./data/HUGETEXTFILE.txt")
  File "run_classify_comments.py", line 16, in get_all_words
    def get_all_words(path): return words(file(path).read())
  File "run_classify_comments.py", line 15, in words
    def words(text): return re.findall('[a-z]+', text.lower())
  File "C:\Program Files (x86)\Python26\lib\re.py", line 175, in findall
    return _compile(pattern, flags).findall(string)
MemoryError

所以，正则表达式模块在64GB内存下崩溃了……我觉得不可能啊……这到底是为什么呢？我该怎么设置Python，让它能用上我机器上所有的内存呢？

正则表达式内存管理特征提取 windows服务器进程优化

2 个回答

我觉得问题出在你用 re.findall() 把整个文本作为一个单词列表读进内存。你是一次性读取超过 64GB 的文本吗？根据你实现的 NaiveBayes 算法，可能更好的做法是逐步构建你的频率字典，这样只需要把字典放在内存中，而不是整个文本。如果你能提供更多关于你实现的细节，可能会更直接地帮助解决你的问题。

回答于 2025-04-16 由 Python大师

分享举报

只需要把你的程序改成一次读取一个大文本文件的行。这很简单，只需把 get_all_words(path) 改成：

def get_all_words(path):
    return sum((words(line) for line in open(path))

注意括号里使用了一个生成器，这种方式是懒惰的，只有在需要的时候才会被求值，正好可以用在求和的函数里。

回答于 2025-04-16 由 Python大师

分享举报

可以增加Python进程使用的内存吗

2 个回答

撰写回答