Python os.walk 内存问题

7 投票

3 回答

1727 浏览

数据工程师

提问于 2025-04-18 11:28

我写了一个扫描程序，用来在系统的所有硬盘上寻找特定的文件。有些系统比较老旧，运行的是Windows 2000，内存只有256或512 MB，但它们的文件系统结构很复杂，因为有些还充当文件服务器。

在我的脚本中，我使用了os.walk()这个函数来遍历所有的目录和文件。

不幸的是，我们发现这个扫描程序在扫描一段时间后会消耗大量的内存。经过2小时的扫描，os.walk函数就单独使用了大约50 MB的内存。这个内存使用量随着时间的推移还在增加。经过4小时的扫描，我们的内存使用量达到了90 MB。

有没有办法避免这种情况呢？我们也尝试过“betterwalk.walk()”和“scandir.walk()”，结果都是一样的。我们是否需要自己写一个遍历函数，定期从内存中移除已经扫描过的目录和文件对象，这样垃圾回收器才能清理它们呢？

资源使用情况随时间变化 - 第二行是内存

谢谢

代码优化文件系统内存管理垃圾回收目录遍历资源消耗 Windows 2000 扫描程序

3 个回答

生成器是一种更好的解决方案，因为它们可以延迟计算，也就是说，只有在需要的时候才会进行计算。下面是一个实现的例子。

import os
import fnmatch

#this may or may not be implemented
def list_dir(path):
    for name in os.listdir(path):
        yield os.path.join(path, name)

#modify this to take some pattern as input 
def os_walker(top):
    for root,dlist,flist in os.walk(top):
        for name in fnmatch.filter(flist, '*.py'):
            yield os.path.join(root, name)

all_dirs = list_dir("D:\\tuts\\pycharm")

for l in all_dirs:
    for name in os_walker(l):
        print(name)

感谢 David Beazley

回答于 2025-04-18 由 Python大师

分享举报

如果你在使用 os.walk 这个循环，记得把你不再需要的东西用 del() 删除掉。然后在每次 os.walk 循环结束时，试着运行一下 gc.collect()。

回答于 2025-04-18 由 Python大师

分享举报

你试过使用glob模块吗？

import os, glob

def globit(srchDir):
    srchDir = os.path.join(srchDir, "*")
    for file in glob.glob(srchDir):
        print file
        globit(file)

if __name__ == '__main__':
    dir = r'C:\working'
    globit(dir)

回答于 2025-04-18 由 Python大师

分享举报

Python os.walk 内存问题

3 个回答

撰写回答