<p>下面是一个基于Python和Excel的方法,将在当前机器上运行。(我假设购买额外的RAM、在云上运行或使用数据库是不可行的。)</p>
<p>首先,创建两个示例数据帧以进行说明。(如有必要,您可以使用Excel本身将.xlsx文件转换为.csv,然后分块读取每个.csv文件。)</p>
<pre><code>import pandas as pd
df1 = pd.DataFrame({'a': [11, 12, 13], 'b': [14, 15, 16], 'c': [17, 18, 19]})
df2 = pd.DataFrame({'a': [13, 14, 15], 'b': [16, 17, 18], 'c': [19, 20, 21]})
print(pd.concat([df1, df2]))
a b c
0 11 14 17
1 12 15 18
2 13 16 19 # copy 1
0 13 16 19 # copy 2 (duplicate)
1 14 17 20
2 15 18 21
</code></pre>
<p>现在迭代每个数据帧的每一行。计算每行的哈希值。如果您以前见过散列,那么您以前也见过行本身,所以请继续。如果是新的哈希值,则导出该行并将哈希值添加到集合中</p>
<pre><code>cache = set()
for d in [df1, df2]:
for row in d.itertuples():
h = hash(row[1:]) # start at one to _exclude_ the index
if h in cache:
continue
print(row, h)
cache.add(h)
# you'll get this output, printed for illustration.
# it's easy to parse, for a variety of downstream uses
# note that the duplicate row was NOT exported
Pandas(Index=0, a=11, b=14, c=17) -8507403049323048181
Pandas(Index=1, a=12, b=15, c=18) -2181970144325339791
Pandas(Index=2, a=13, b=16, c=19) -6414432197316917400
Pandas(Index=1, a=14, b=17, c=20) -7535143174237721170
Pandas(Index=2, a=15, b=18, c=21) 4031466940829764140
</code></pre>