在CSV文件中标记重复项

2 投票
5 回答
4567 浏览
提问于 2025-04-15 16:00

我遇到了一个问题,下面的例子说明了我的困扰:

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,

我想要在“PHONE”这一列中找出重复的号码,并用“REF”这一列来标记后面的重复项,标记的内容是指向第一个号码的“ID”,同时在“DISCARD”这一列中填上“是”。

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"

那么,我该怎么做呢?我试过一些代码,但我的思路显然不对。

import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

如果你能给我一些指导和帮助,我将非常感激。

5 个回答

0

听起来像是作业。因为这是一个CSV文件(所以改变记录的大小几乎是不可能的),最好的办法是把整个文件加载到内存中,然后在内存中进行处理,最后再写入一个新文件。你可以先创建一个字符串列表,里面存放文件的原始行。接着,创建一个映射,把电话号码作为键,ID作为值。在插入之前,你需要先检查这个号码是否已经存在。如果已经存在,就更新包含重复电话号码的那一行。如果这个号码不在映射中,就插入这个(电话号码,ID)对。

0
from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open('data.csv','rb'))

verum.sort(key=itemgetter(2,0))
def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line

输出结果:

['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']

把数据写回去的部分就留给你自己去完成啦 ;-)

7

在这个程序运行的时候,你只需要记住一个东西,那就是把电话号码和它们对应的ID放在一个表里。

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

撰写回答