如何从文本文件中删除重复行以及与此重复相关的唯一行

2条回答

网友

1楼 · 编辑于 2024-04-18 20:47:02

根据您的输入，您可以执行以下操作：

seen = {} # key maps to index
double_seen = set()

with open('input.txt') as f:
    for line in f:
        _, key = line.split(':')
        key = key.strip()
        if key not in seen: # Have not seen this yet?
            seen[key] = line # Then add it to the dictionary
        else:
            double_seen.add(key) # Else we have seen this more thane once

# Now we can just write back to a different file
with open('output.txt', 'w') as f2:
    for key in set(seen.keys()) - double_seen:
        f2.write(seen[key])

我使用的输入：

line 1 : Messi
line 2 : Messi
line 3 : CR7

输出：

line 3 : CR7

注意：这个解决方案假定Python3.7+，因为它假定字典是按插入顺序排列的。你知道吗

网友

2楼 · 编辑于 2024-04-18 20:47:02

你试过Counter吗？例如：

import collections

a = [1, 1, 2]

out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)

输出：[2] 或者举一个较长的例子：

import collections

a = [1, 1, 1, 2, 4, 4, 4, 5, 3]

out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)

输出：[2, 5, 3]

编辑：

由于一开始没有列表，根据文件大小，有两种方法，第一种方法用于足够小的文件（否则可能会出现内存问题），第二种方法用于较大的文件。你知道吗

以列表形式读取文件并使用上一个答案：

import collections

lines = [line for line in open(infilename)]
out = [k for k, v in collections.Counter(lines).items() if v == 1]
with open(outfilename, 'w') as outfile:
    for o in out:
        outfile.write(o)

第一行以列表的形式完整地读取文件。这意味着，真正大的文件将加载到您的内存中。如果你需要大文件，你可以继续使用一种“黑名单”：

使用黑名单：

lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen and line not in blacklist: # not a duplicate
        lines_seen.add(line)
    else:
        lines_seen.discard(line)
        blacklist.add(line)
for l in lines_seen:
    outfile.write(l)
outfile.close()

在这里，您将所有行添加到集合中，并且只将集合写入末尾的文件中。黑名单会记住所有多次出现的情况，因此即使一次也不会写多行。你不可能一次就完成，因为你不知道，如果第二次出现同一行的话。如果你有更多的信息（如多行总是连续出现），你可以做不同的

编辑2

如果要根据第一部分进行操作：

firsts_seen = set()
lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    first = line.split(',')[0]
    if first not in firsts_seen and first not in blacklist: # not a duplicate
        lines_seen.add(line)
        firsts_seen.add(first)
    else:
        lines_seen.discard(line)
        firsts_seen.discard(first)
        blacklist.add(first)
print(len(lines_seen))
for l in lines_seen:
    outfile.write(l)
outfile.close()

注：到现在为止，我刚刚添加了代码，可能有更好的方法

例如，使用dict：

lines_dict = {}
for line in open(infilename, 'r'):
    if line.split(',')[0] not in lines_dict:
        lines_dict[line.split(',')[0]] = [line]
    else:
        lines_dict[line.split(',')[0]].append(line)
with open(outfilename, 'w') as outfile:
    for key, value in lines_dict.items():
        if len(value) == 1:
            outfile.write(value[0])

编辑：

以列表形式读取文件并使用上一个答案：

使用黑名单：

编辑2

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从文本文件中删除重复行以及与此重复相关的唯一行

编辑：

以列表形式读取文件并使用上一个答案：

使用黑名单：

编辑2

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >