快速找出两个大文本文件之间的差异

10 投票

5 回答

19042 浏览

数据工程师

提问于 2025-04-16 03:09

我有两个3GB的文本文件，每个文件大约有8000万行。而且这两个文件有99.9%是相同的（文件A有60,000行独特的内容，文件B有80,000行独特的内容）。

我该如何快速找到这两个文件中的独特内容呢？有没有现成的命令行工具可以使用？我在用Python，但我觉得用Python来加载文件和比较可能效率不高。

欢迎任何建议。

大数据处理文本处理命令行工具数据分析文本比较文件差异

5 个回答

如果你有60,000或80,000个独特的行，你可以为每一行创建一个字典，把它们映射到一个数字上。比如说，mydict["hello world"] => 1，这样。假设你每行的平均长度在40到80个字符之间，这样大概会占用10MB的内存。

接着，你可以读取每个文件，把它们转换成数字数组，使用之前的字典。这样的数据很容易放进内存里（比如两个文件，每个文件8字节，3GB的总大小，60,000行的数据占用的内存不到1MB）。然后你可以比较这些列表。你还可以反转字典，用它来打印出那些不同的行的文本。

编辑：

根据你的评论，这里有一个示例脚本，它在读取文件时为独特的行分配数字。

#!/usr/bin/python

class Reader:

    def __init__(self, file):
        self.count = 0
        self.dict = {}
        self.file = file

    def readline(self):
        line = self.file.readline()
        if not line:
            return None
        if self.dict.has_key(line):
            return self.dict[line]
        else:
            self.count = self.count + 1
            self.dict[line] = self.count
            return self.count

if __name__ == '__main__':
    print "Type Ctrl-D to quit."
    import sys
    r = Reader(sys.stdin)
    result = 'ignore'
    while result:
        result = r.readline()
        print result

回答于 2025-04-16 由 Python大师

分享举报

我觉得这是最快的方法（无论是在Python还是其他语言，这个应该没太大关系）。

注意事项：

1. 我只存储每一行的哈希值，这样可以节省空间（如果需要分页的话，也能节省时间）。

2. 由于上述原因，我只打印出行号；如果你需要实际的内容，就得重新读取文件了。

3. 我假设哈希函数不会出现冲突。虽然几乎可以确定，但并不是百分之百的安全。

4. 我导入了hashlib库，因为内置的hash()函数太短，容易出现冲突。

import sys
import hashlib

file = []
lines = []
for i in range(2):
    # open the files named in the command line
    file.append(open(sys.argv[1+i], 'r'))
    # stores the hash value and the line number for each line in file i
    lines.append({})
    # assuming you like counting lines starting with 1
    counter = 1
    while 1:
        # assuming default encoding is sufficient to handle the input file
        line = file[i].readline().encode()
        if not line: break
        hashcode = hashlib.sha512(line).hexdigest()
        lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
        counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]

回答于 2025-04-16 由 Python大师

分享举报

如果顺序很重要，可以试试 comm 这个工具。如果顺序不重要，可以用 sort file1 file2 | uniq -u 这个命令。

回答于 2025-04-16 由 Python大师

分享举报

快速找出两个大文本文件之间的差异

5 个回答

撰写回答