比较两个CSV文件并搜索相似项

2024-04-29 13:20:33 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我有两个CSV文件,我试图比较并得到类似项目的结果。 第一个文件hosts.csv如下所示:

Path    Filename    Size    Signature
C:\     a.txt       14kb    012345
D:\     b.txt       99kb    678910
C:\     c.txt       44kb    111213

第二个文件masterlist.csv如下所示:

Filename    Signature
b.txt       678910
x.txt       111213
b.txt       777777
c.txt       999999

如您所见,行不匹配,masterlist.csv始终大于hosts.csv文件。我只想搜索签名部分。我知道这看起来像是:

hosts[3] == masterlist[1]

我正在寻找一个解决方案,它将提供如下内容(基本上是hosts.csv文件和一个新的结果列):

Path    Filename    Size    Signature    RESULTS
C:\     a.txt       14kb    012345       NOT FOUND in masterlist
D:\     b.txt       99kb    678910       FOUND in masterlist (row 1)
C:\     c.txt       44kb    111213       FOUND in masterlist (row 2)

我已经搜索了这些帖子,发现了一些类似于here的东西,但是我不太明白,因为我还在学习python。

使用Python2.6编辑


Tags: 文件csvpath项目intxtsize解决方案
3条回答

编辑:当我的解决方案正常工作时,请查看下面Martijn的答案以获得更有效的解决方案。

您可以找到python CSV模块here的文档。

你要找的是这样的东西:

import csv

f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

masterlist = list(c2)

for hosts_row in c1:
    row = 1
    found = False
    for master_row in masterlist:
        results_row = hosts_row
        if hosts_row[3] == master_row[1]:
            results_row.append('FOUND in master list (row ' + str(row) + ')')
            found = True
            break
        row = row + 1
    if not found:
        results_row.append('NOT FOUND in master list')
    c3.writerow(results_row)

f1.close()
f2.close()
f3.close()

Python的CSV和collections模块,特别是OrderedDict,在这里非常有用。你想使用OrderedDict来维护密钥的顺序,等等。你不必这么做,但它很有用!

import csv
from collections import OrderedDict


signature_row_map = OrderedDict()


with open('hosts.csv') as file_object:
    for line in csv.DictReader(file_object, delimiter='\t'):
        signature_row_map[line['Signature']] = {'line': line, 'found_at': None}


with open('masterlist.csv') as file_object:
    for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):
        if line['Signature'] in signature_row_map:
            signature_row_map[line['Signature']]['found_at'] = i


with open('newhosts.csv', 'w') as file_object:
    fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
    writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')
    writer.writer.writerow(fieldnames)
    for signature_info in signature_row_map.itervalues():
        result = '{0} FOUND in masterlist {1}'
        # explicit check for sentinel
        if signature_info['found_at'] is not None:
            result = result.format('', '(row %s)' % signature_info['found_at'])
        else:
            result = result.format('NOT', '')
        payload = signature_info['line']
        payload['RESULTS'] = result

        writer.writerow(payload)

下面是使用测试CSV文件的输出:

Path    Filename        Size    Signature       RESULTS
C:\     a.txt   14kb    012345  NOT FOUND in masterlist 
D:\     b.txt   99kb    678910   FOUND in masterlist (row 1)
C:\     c.txt   44kb    111213   FOUND in masterlist (row 2)

请原谅偏差,它们是分开的:)

srgerg的答案是非常低效的,因为它在二次时间内运行;下面是一个线性时间解决方案,它使用了Python 2.6兼容的语法:

import csv

with open('masterlist.csv', 'rb') as master:
    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))

with open('hosts.csv', 'rb') as hosts:
    with open('results.csv', 'wb') as results:    
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['RESULTS'])

        for row in reader:
            index = master_indices.get(row[3])
            if index is not None:
                message = 'FOUND in master list (row {})'.format(index)
            else:
                message = 'NOT FOUND in master list'
            writer.writerow(row + [message])

这将生成一个字典,首先将签名从masterlist.csv映射到行号。在字典中查找需要恒定的时间,使第二个循环在hosts.csv行上,独立于masterlist.csv中的行数。更不用说更简单的代码了。

对于那些使用Python 3的用户,上面的调用只需要将open()调整为以文本模式打开(从文件模式中删除b),并且您希望添加new line='',这样CSV读取器就可以控制行分隔符。您可能希望声明要显式使用的编码,而不是依赖于系统默认值(使用encoding=...)。可以通过字典理解({r[1]: i for i, r in enumerate(csv.reader(master))})来构建master_indices映射。

相关问题 更多 >