如何从文本文件中删除重复行以及与此重复相关的唯一行

2024-04-18 20:47:02 发布

您现在位置:Python中文网/ 问答频道 /正文

如何从文件中删除重复行以及与此重复相关的唯一行?你知道吗

示例:

输入文件:

    line 1 : Messi , 1 
    line 2 : Messi , 2
    line 3 : CR7 , 2

我希望输出文件是:

line 1 : CR7 , 2

Just(“CR7,2”我想删除重复的行以及与此重复相关的唯一行)

如果第一行中有匹配项,则删除取决于第一行我要删除此行

如何在python中实现这一点 使用此代码可以编辑哪些内容:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

做这项工作最好的方法是什么?你知道吗


Tags: 文件代码in编辑示例内容linenot
2条回答

根据您的输入,您可以执行以下操作:

seen = {} # key maps to index
double_seen = set()

with open('input.txt') as f:
    for line in f:
        _, key = line.split(':')
        key = key.strip()
        if key not in seen: # Have not seen this yet?
            seen[key] = line # Then add it to the dictionary
        else:
            double_seen.add(key) # Else we have seen this more thane once

# Now we can just write back to a different file
with open('output.txt', 'w') as f2:
    for key in set(seen.keys()) - double_seen:
        f2.write(seen[key])

我使用的输入:

line 1 : Messi
line 2 : Messi
line 3 : CR7

输出:

line 3 : CR7

注意:这个解决方案假定Python3.7+,因为它假定字典是按插入顺序排列的。你知道吗

你试过Counter吗? 例如:

import collections

a = [1, 1, 2]

out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)

输出:[2] 或者举一个较长的例子:

import collections

a = [1, 1, 1, 2, 4, 4, 4, 5, 3]

out = [k for k, v in collections.Counter(a).items() if v == 1]
print(out)

输出:[2, 5, 3]

编辑:

由于一开始没有列表,根据文件大小,有两种方法,第一种方法用于足够小的文件(否则可能会出现内存问题),第二种方法用于较大的文件。你知道吗

以列表形式读取文件并使用上一个答案:

import collections

lines = [line for line in open(infilename)]
out = [k for k, v in collections.Counter(lines).items() if v == 1]
with open(outfilename, 'w') as outfile:
    for o in out:
        outfile.write(o)

第一行以列表的形式完整地读取文件。这意味着,真正大的文件将加载到您的内存中。如果你需要大文件,你可以继续使用一种“黑名单”:

使用黑名单:

lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen and line not in blacklist: # not a duplicate
        lines_seen.add(line)
    else:
        lines_seen.discard(line)
        blacklist.add(line)
for l in lines_seen:
    outfile.write(l)
outfile.close()

在这里,您将所有行添加到集合中,并且只将集合写入末尾的文件中。黑名单会记住所有多次出现的情况,因此即使一次也不会写多行。你不可能一次就完成,因为你不知道,如果第二次出现同一行的话。如果你有更多的信息(如多行总是连续出现),你可以做不同的

编辑2

如果要根据第一部分进行操作:

firsts_seen = set()
lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    first = line.split(',')[0]
    if first not in firsts_seen and first not in blacklist: # not a duplicate
        lines_seen.add(line)
        firsts_seen.add(first)
    else:
        lines_seen.discard(line)
        firsts_seen.discard(first)
        blacklist.add(first)
print(len(lines_seen))
for l in lines_seen:
    outfile.write(l)
outfile.close()

注:到现在为止,我刚刚添加了代码,可能有更好的方法

例如,使用dict:

lines_dict = {}
for line in open(infilename, 'r'):
    if line.split(',')[0] not in lines_dict:
        lines_dict[line.split(',')[0]] = [line]
    else:
        lines_dict[line.split(',')[0]].append(line)
with open(outfilename, 'w') as outfile:
    for key, value in lines_dict.items():
        if len(value) == 1:
            outfile.write(value[0])

相关问题 更多 >