在Python中查找两个CSV文件的共同区域
我有两个CSV文件,每个文件有10列,第一列叫做“主键”。
我需要用Python找出这两个CSV文件之间的共同区域。比如,我应该能发现CSV1中的第27到45行和CSV2中的第125到145行是相等的,依此类推。
我只比较主键(第一列),其他的数据不考虑。我需要把这些共同区域提取到两个单独的CSV文件中(一个对应CSV1,一个对应CSV2)。
我已经把两个CSV文件的行解析并存储在两个“列表的列表”中,分别是 lstCAN_LOG_TABLE
和 lstSHADOW_LOG_TABLE
,所以问题就变成了比较这两个列表的列表。
我现在的假设是,如果有10个连续的匹配(MAX_COMMON_THRESHOLD
),那么我就找到了一个共同区域的开始。我不能只记录单行(比较为真),因为可能会有相等的区域(根据主键),而这些区域是我需要识别的。
for index in range(len(lstCAN_LOG_TABLE)):
for l_index in range(len(lstSHADOW_LOG_TABLE)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
can_index = index
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
print("No Match")
break
#we might have found start of common region
if(bPreScreened == 1):
print("Start={0} End={1} can_index={2}".format(start,end,can_index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
我得到的输出很奇怪。即使第一个CSV文件只有1880行,但在第一个CSV的共同区域CSV中,我却得到了更多的条目。我没有得到想要的输出。
从这里开始编辑
CSV1:
216 0.000238225 F4 41 C0 FB 28 0 0 0 MS CAN
109 0.0002256 15 8B 31 0 8 43 58 0 HS CAN
216 0.000238025 FB 47 C6 1 28 0 0 0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
CSV2:
216 0.0002361 0F 5C DB 14 28 0 0 0 MS CAN
216 0.000236225 16 63 E2 1B 28 0 0 0 MS CAN
109 0.0001412 16 A3 31 0 8 63 58 0 HS CAN
216 0.000234075 1C 6A E9 22 28 0 0 0 MS CAN
40A 0.000259925 C1 1 46 54 30 44 47 36 HS CAN
4A 0.000565975 2 0 0 0 0 0 0 C0 MS CAN
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
期望的输出 CSV1:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
期望的输出 CSV2:
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
观察到的输出 CSV1
340 0.000240175 0A 18 0 C2 0 0 6F FF MS CAN
216 0.000240225 24 70 EF 28 28 0 0 0 MS CAN
216 0.000236225 2B 77 F7 2F 28 0 0 0 MS CAN
216 0.0002278 31 7D FD 35 28 0 0 0 MS CAN
还有成千上万的冗余行数据
编辑 - 根据建议解决(将FOR改为WHILE):
学习: 在Python中,FOR循环的索引在运行时不能更改
dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))
index = 0
while(index < (input_file_one_row_count - 1)):
l_index = 0
while(l_index < (input_file_two_row_count - 1)):
if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]): #Consider for comparison only CAN IDs
index_can_log = index #Position where CAN Log is to be compared
index_shadow_log = l_index #Position from where CAN Shadow Log is to be considered
start = index_shadow_log
can_index = index
if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
end = index_shadow_log + MAX_COMMON_THRESHOLD
else:
end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
bPreScreened = 1
for num in range(start,end):
if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
if((can_index + 1) < (input_file_one_row_count-1)):
can_index = can_index + 1
else:
break
else:
bPreScreened = 0
break
#we might have found start of common region
if(bPreScreened == 1):
print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
for number in range(start,end):
if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):
writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
writer_one.writerow(lstCAN_LOG_TABLE[index][0])
if((index + 1) < (input_file_one_row_count-1)):
index = index + 1
if((l_index + 1) < (input_file_two_row_count-1)):
l_index = l_index + 1
else:
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
return
else:
l_index = l_index + 1
else:
l_index = l_index + 1
index = index + 1
dump_file.close()
print("\nCommon Region in Two CSVs identifed and recorded\n")
1 个回答
index
是你在 for
循环中的一个计数器。如果你在循环内部改变了它的值,每次循环结束后它都会被重新赋值。
比如说,当 index = 5
时,如果你执行了三次 index += 1
,那么现在 index = 8
。但是当这一轮循环结束后,代码回到 for
循环时,index
会被重新赋值为 index x = 6
。
试试下面的例子:
for index in range(0,5):
print 'iterator:', index
index = index + 2
print 'index:', index
输出结果将会是:
iterator: 0
index: 2
iterator: 1
index: 3
iterator: 2
index: 4
iterator: 3
index: 5
iterator: 4
index: 6
要解决这个问题,你可以考虑把 for
循环改成 while
循环。
编辑:如果我没理解错的话,你是想在两个文件中找到相同的列并存储它们。如果是这样,其实你可以用下面的代码轻松完成这个工作:
import csv # import csv module to read csv files
file1 = 'csv1.csv' # input file 1
file2 = 'csv2.csv' # input file 2
outfile = 'csv3.csv' # only have one output file since two output files will be the same
read1 = csv.reader(open(file1, 'r')) # read input file 1
write = csv.writer(open(outfile, 'w')) # write to output file
# for each row in input file 1, compare it with each row in input file 2
# if they are the same, write that row into output file
for row1 in read1:
read2 = csv.reader(open(file2, 'r'))
for row2 in read2:
if row1 == row2:
write.writerow(row1)
read1.close()
write.close()