Python - 排序CSV列并比较行
我有一个来自网络抓包的 .csv 文件。在这个文件里,有一些重复的消息需要我去识别。
发送者 A,接收者 G,43,信息...
发送者 H,接收者 R,43,信息...
发送者 A,接收者 G,27,信息...
发送者 N,接收者 Z,43,信息...
发送者 A,接收者 G,1367,信息...
发送者 R,接收者 P,43,信息...
发送者 A,接收者 G,43,信息...
发送者 H,接收者 R,111,信息...
重复的参数是标识符,但这并不一定意味着消息是重复的。在这种情况下,我还需要检查发送者和接收者。我想过先按第三列排序,然后从上到下循环比较这些列的值。虽然我已经成功地把重复数字的行隔离到一个文件里,但我在这里遇到了问题。
首先,我无法正确排序。
其次,我不知道怎么同时读取并比较一列(在我这个例子中是两列)和下面的值。我觉得可以用一个嵌套的 if 语句来实现,比如说 (如果 row[2] == row[2, 下面的行],那么检查 row[0] 和 row[1] 是否等于下面行的 row[0] 和 row[1]),但是我想了很久也没能写出一个能正常工作的代码来进行这个比较。
我的想法是打印或保存那些同时满足这三个条件(基本上就是前面三列)重复的情况。
发送者 A,接收者 G,43,信息...
发送者 A,接收者 G,43,信息...
也许我把事情想得太复杂了,可能有更简单或更快的方法。无论如何,我把我的代码发上来,希望有人能帮我。谢谢!
entries = []
duplicated = []
with open('file.csv', 'rt') as my_file:
for line in my_file:
columns = line.strip().split(',')
if columns[2] not in entries:
entries.append(columns[2])
else:
duplicated.append(columns[2])
#List with duplicated=null->no error
if duplicated==[]:
print "\nNo duplicated\n"
#Other case, there might be duplicates
else:
#Store error cases in New.csv
with open('New.csv', 'w') as out_file:
with open('file.csv', 'r') as my_file:
for line in my_file:
columns = line.strip().split(',')
if columns[2] in duplicate_entries:
out_file.write(line)
#TO SORT THE EXCEL FILE. CURRENTLY NOT WORKING PROPERLY
## data = csv.reader(open('Other.csv'),delimiter=',')
## sortedlist = sorted(data, key=operator.itemgetter(2), reverse=True)
## with open('Other.csv', 'w') as out_file:
## for item in sortedlist:
## out_file.write(item)
2 个回答
Martijn Pieters给你展示了很棒的“纯”Python解决方案,而我给你展示一些不同的东西——用pandas
模块的例子。
(我使用StringIO
来模拟读取文件)
data = """Sender A,Receiver G,43,Info...
Sender H,Receiver R,43,Info...
Sender A,Receiver G,27,Info...
Sender N,Receiver Z,43,Info...
Sender A,Receiver G,1367,Info...
Sender R,Receiver P,43,Info...
Sender A,Receiver G,43,Info...
Sender H,Receiver R,111,Info..."""
import pandas as pd
from StringIO import StringIO
# read all file
df = pd.read_csv(StringIO(data), index_col=None, header=None)
print df
# group rows by values in columns 0, 1, 2
for name, group in df.groupby([0,1,2]):
print '\n', '-'*40, '\n'
print 'name:', name
print 'len:', len(group)
print
print group
if len(group) > 1:
# append (`mode='a'`) data to `results.csv`
group.to_csv('results.csv', mode='a', header=False, index=False)
#group.to_csv('results.csv', mode='a', header=False)
我用pd.read_csv()
来读取整个文件。
(我假设文件里没有带标题的行,所以设置header=None
,
而且我没有使用任何列作为行索引,所以设置index_col=None
)
然后我根据第0、1、2列的值来分组行(并打印出来)。
如果某个组的元素超过一个,我就把它添加到文件'results.csv'
中。
我得到的文件是
Sender A,Receiver G,43,Info...
Sender A,Receiver G,43,Info...
或者如果我在to_csv()
中不使用index=False
,我还会得到行号(索引)。
0,Sender A,Receiver G,43,Info...
6,Sender A,Receiver G,43,Info...
这就是我在屏幕上打印的内容。
0 1 2 3
0 Sender A Receiver G 43 Info...
1 Sender H Receiver R 43 Info...
2 Sender A Receiver G 27 Info...
3 Sender N Receiver Z 43 Info...
4 Sender A Receiver G 1367 Info...
5 Sender R Receiver P 43 Info...
6 Sender A Receiver G 43 Info...
7 Sender H Receiver R 111 Info...
----------------------------------------
name: ('Sender A', 'Receiver G', 27)
len: 1
0 1 2 3
2 Sender A Receiver G 27 Info...
----------------------------------------
name: ('Sender A', 'Receiver G', 43)
len: 2
0 1 2 3
0 Sender A Receiver G 43 Info...
6 Sender A Receiver G 43 Info...
----------------------------------------
name: ('Sender A', 'Receiver G', 1367)
len: 1
0 1 2 3
4 Sender A Receiver G 1367 Info...
----------------------------------------
name: ('Sender H', 'Receiver R', 43)
len: 1
0 1 2 3
1 Sender H Receiver R 43 Info...
----------------------------------------
name: ('Sender H', 'Receiver R', 111)
len: 1
0 1 2 3
7 Sender H Receiver R 111 Info...
----------------------------------------
name: ('Sender N', 'Receiver Z', 43)
len: 1
0 1 2 3
3 Sender N Receiver Z 43 Info...
----------------------------------------
name: ('Sender R', 'Receiver P', 43)
len: 1
0 1 2 3
5 Sender R Receiver P 43 Info...
其实没必要对文件进行排序,但你的排序可能遇到了字符串和数字排序的问题。字符串是按照字典顺序排序的,这意味着 '10'
会排在 '2'
前面,因为在字符集中 1
比 2
早,而 0
在这里并不影响排序。
你可以通过把重复的内容存储在一个字典里来追踪它们;这样你就可以查找之前见过的匹配项。使用 collections.defaultdict()
是最简单的方式:
import csv
from collections import defaultdict
seen = defaultdict(list)
with open('file.csv', 'rb') as my_file:
reader = csv.reader(my_file)
for row in reader:
key = (row[0], row[1], row[2]) # sender, receiver, id
seen[key].append(row)
with open('new.csv', 'wb') as outf:
writer = csv.writer(outf)
for collected in seen.values():
if len(collected) > 1:
writer.writerows(collected)
这个版本会根据 (发送者, 接收者, id) 这三个元素对输入的 CSV 文件进行分组,然后再把所有的行写出来,但只有在每组三个元素有多于一行的情况下才会写出。
你也可以保持一个计数;在字典中统计每个三元组出现的次数;使用 collections.Counter()
也能很方便地做到这一点,并且可以在之后按频率进行排序:
import csv
from collections import Counter
with open('file.csv', 'rb') as my_file:
reader = csv.reader(my_file)
counts = Counter((r[0], r[1], r[2]) for r in reader)
with open('new.csv', 'wb') as outf:
writer = csv.writer(outf)
for (sender, receiver, id), count in counts.most_common():
writer.writerow([sender, receiver, id, count])
用你的示例数据进行演示:
>>> import csv
>>> from collections import defaultdict
>>> sample = '''\
... Sender A,Receiver G,43,Info...
... Sender H,Receiver R,43,Info...
... Sender A,Receiver G,27,Info...
... Sender N,Receiver Z,43,Info...
... Sender A,Receiver G,1367,Info...
... Sender R,Receiver P,43,Info...
... Sender A,Receiver G,43,Info...
... Sender H,Receiver R,111,Info...
... '''.splitlines(True)
>>> seen = defaultdict(list)
>>> reader = csv.reader(sample)
>>> for row in reader:
... key = (row[0], row[1], row[2]) # sender, receiver, id
... seen[key].append(row)
...
>>> import sys
>>> writer = csv.writer(sys.stdout)
>>> for collected in seen.values():
... if len(collected) > 1:
... writer.writerows(collected)
...
Sender A,Receiver G,43,Info...
Sender A,Receiver G,43,Info...
或者使用 Counter
的方法:
>>> from collections import Counter
>>> reader = csv.reader(sample)
>>> counts = Counter((r[0], r[1], r[2]) for r in reader)
>>> writer = csv.writer(sys.stdout)
>>> for (sender, receiver, id), count in counts.most_common():
... writer.writerow([sender, receiver, id, count])
...
Sender A,Receiver G,43,2
Sender A,Receiver G,1367,1
Sender A,Receiver G,27,1
Sender N,Receiver Z,43,1
Sender H,Receiver R,111,1
Sender H,Receiver R,43,1
Sender R,Receiver P,43,1