在顺序哈希大量文件时Python内存错误

2 投票

1 回答

1110 浏览

提问于 2025-04-18 11:31

基本上，我是在为了练习和玩乐，创建一个我所有文件的列表，给每个文件生成一个哈希值，然后在这个列表中查找重复的文件。这只是我个人的项目，不在乎效果如何。

我一直遇到以下错误：

python(16563) malloc: *** mmap(size=140400698290176) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

这个脚本正在读取一个22MB的文本文件，而我将要生成哈希值的许多文件是视频，通常但不总是小于4GB。我不是一个很好的程序员; 我通常设计用户界面和用户体验，所以内存、硬件、计算这些我都不太懂。

我在使用Mac（10.8.5），有32GB的内存。Python在64位模式下运行（2.7）。我正在扫描7个卷；我的启动盘是最小的，只有90GB的固态硬盘。

# This class mostly just calls os.stat and hashes the file
class FileInspection:
    def __init__(self, path):
        self.path = path
        self.hash = self.double_hash(path)
        self.inspectable = True
        stats = os.stat(path)
        self.size = stats.st_size
        self.last_mod = stats.st_mtime

    def double_hash(self, path):
        checksum = None
        matching_checksums = False
        match_attempts = 0
        while not matching_checksums:
            match_attempts += 1
            fileData = open(self.path).read()
            checksum1 = hashlib.md5(fileData).hexdigest()
            checksum2 = hashlib.md5(fileData).hexdigest()
            if checksum1 == checksum2:
                checksum = checksum1
                matching_checksums = True
            elif match_attempts > 3:
                self.hash = False
                self.inspectable = False
        return checksum


# This is the main function call
def masterList(self, path):
        f = open(path, "r")
        lines = f.readlines()
        f.close()

        f = open(path, "w")
        for line in lines:
            line = line[:-1]
            fileInfo = FileInspection(line)
            fileStr = 'f_{0} = {1}"checked":False, "path":"{2}", "inspectable":{3}, "hash":, "filesize":{4}, "lastmod":"{5}"{6}'.format(fileInfo.hash, "{", fileInfo.path, fileInfo.inspectable, fileInfo.hash, fileInfo.last_mod, "}")
            f.write(fileStr)
        f.close()

masterList("/path/to/a/giant/list/of/files.txt")

内存管理数据处理操作系统 64位系统哈希算法文件去重用户体验设计计算机硬件

1 个回答

你可能在用32位的Python，而你试图加载超过4GB的数据到程序里。你可以尝试用64位的Python来运行代码，或者对你在double_hash函数里的md5代码进行一些调整：

fileData = open(self.path).read()
checksum1 = hashlib.md5(fileData).hexdigest()
checksum2 = hashlib.md5(fileData).hexdigest() # Why calculate this twice?

改成这样：

read_size = 1024 # You can make this bigger
checksum1 = hashlib.md5()
with open(self.path, 'rb') as f:
    data = f.read(read_size)
    while data:
        checksum1.update(data)
        data = f.read(read_size)
checksum1 = checksum1.hexdigest()
#continue using checksum1

这样生成md5就可以避免把整个文件都加载到内存里。

回答于 2025-04-18 由 Python大师

分享举报

在顺序哈希大量文件时Python内存错误

1 个回答

撰写回答