在CSV文件中标记重复项
我遇到了一个问题,下面的例子说明了我的困扰:
"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,
我想要在“PHONE”这一列中找出重复的号码,并用“REF”这一列来标记后面的重复项,标记的内容是指向第一个号码的“ID”,同时在“DISCARD”这一列中填上“是”。
"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"
那么,我该怎么做呢?我试过一些代码,但我的思路显然不对。
import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")
reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
if row[2] == verum[i][2]:
verum[i][3] = row[0]
print verum
如果你能给我一些指导和帮助,我将非常感激。
5 个回答
0
听起来像是作业。因为这是一个CSV文件(所以改变记录的大小几乎是不可能的),最好的办法是把整个文件加载到内存中,然后在内存中进行处理,最后再写入一个新文件。你可以先创建一个字符串列表,里面存放文件的原始行。接着,创建一个映射,把电话号码作为键,ID作为值。在插入之前,你需要先检查这个号码是否已经存在。如果已经存在,就更新包含重复电话号码的那一行。如果这个号码不在映射中,就插入这个(电话号码,ID)对。
0
from operator import itemgetter
from itertools import groupby
import csv
verum = csv.reader(open('data.csv','rb'))
verum.sort(key=itemgetter(2,0))
def grouper( verum ):
for key, grp in groupby(verum,itemgetter(2)):
# key = phone number, grp = records with that number
first = grp.next()
# first item gets its id written into the 4th column
yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first))
for x in grp:
# all others get the first items id as ref
yield [x[0],x[1],x[2], first[0], "Yes"]
for line in sorted(grouper(verum), key=itemgetter(0)):
print line
输出结果:
['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']
把数据写回去的部分就留给你自己去完成啦 ;-)
7
在这个程序运行的时候,你只需要记住一个东西,那就是把电话号码和它们对应的ID放在一个表里。
map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
reader = csv.reader(fin)
with open(r'c:\temp\output.csv', 'w') as fout:
writer = csv.writer(fout)
# omit this if the file has no header row
writer.writerow(next(reader))
for row in reader:
(id, name, phone, ref, discard) = row
if map.has_key(phone):
ref = map[phone]
discard = "YES"
else:
map[phone] = id
writer.writerow((id, name, phone, ref, discard))