通过比较多个列表在Python中移除/列出重复项
我知道关于如何在列表中去除重复项的问题已经被问过了。不过我现在遇到一个问题,就是想要同时比较多个列表。
lst = [item1, item2, item3, item4, item5]
a = [1,2,1,5,1]
b = [2,0,2,5,2]
c = [0,1,0,1,5]
假设这些是我想要比较的列表,我希望像使用zip函数那样来比较它们。我想检查在列表a中,索引0、2和4是否是重复的,同时看看其他列表中的这些索引是否也重复。例如,在列表b中,索引0、2和4也是重复的,但在列表c中,只有索引0和2是重复的。因此,我只想从列表a中列出索引0和2,最终得到一个结果列表[item1, item3]。
我该如何修改这个函数来实现这个目标呢?
def list_duplicates(seq):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list( seen_twice )
a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]
2 个回答
0
你想找出多个列表中哪些位置有重复的值,而不是直接找出这些重复的值。这意味着除了记录在某个seq
中重复的项目外,我们还需要记录这些重复项目出现的位置。这其实很简单,只需要在现有的方法上稍微加点东西:
from collections import defaultdict
def list_duplicates(seq):
seen = set()
seen_twice = set()
seen_indices = defaultdict(list) # To keep track of seen indices
for index, x in enumerate(seq): # Can't use a comprehension now, too much logic in there.
seen_indices[x].append(index)
if x in seen:
seen_twice.add(val)
else:
seen.add(val)
print seen_indices
return list( seen_twice )
if __name__ == "__main__":
a = [1,2,3,2,1,5,6,5,5,5]
duped_items = list_duplicates(a)
print duped_items
这样输出的结果是:
defaultdict(<type 'list'>, {1: [0, 4], 2: [1, 3], 3: [2], 5: [5, 7, 8, 9], 6: [6]})
[1, 2, 5]
所以现在我们不仅记录了重复值本身,还记录了这些重复值的所有位置。
接下来的步骤是要在多个列表中应用这个方法。我们可以利用一个事实:当我们遍历一个列表时,会排除掉一些我们知道不是重复值的位置,然后在后面的列表中只遍历那些已知的重复位置。这需要稍微调整一下逻辑,让我们遍历“可能重复的位置”,而不是整个列表:
def list_duplicates2(*seqs):
val_range = range(0, len(seqs[0])) # At first, all indices could be duplicates.
for seq in seqs:
# Set up is the same as before.
seen_items = set()
seen_twice = set()
seen_indices = defaultdict(list)
for index in val_range: # Iterate over the possibly duplicated indices, not the whole sequence
val = seq[index]
seen_indices[val].append(index)
if val in seen_items:
seen_twice.add(val)
else:
seen_items.add(val)
# Now that we've gone over the current valid_range, we can create a
# new valid_range for the next iteration by only including the indices
# in seq which contained values that we found at least twice in the
# current valid_range.
val_range = [duped_index for seen_val in seen_twice for duped_index in seen_indices[seen_val]]
print "new val_range is %s" % val_range
return val_range
if __name__ == "__main__":
a = [1,2,1,5,1]
b = [2,0,2,5,2]
c = [0,1,0,1,5]
duped_indices = list_duplicates2(a, b, c)
print "duped_indices is %s" % duped_indices
这样输出的结果是:
new val_range is [0, 2, 4]
new val_range is [0, 2, 4]
new val_range is [0, 2]
duped_indices is [0, 2]
这正是你想要的结果。
0
在这个列表中查找重复项
l = [[a[i],b[i],c[i]] for i in range(len(a))]
根据你的例子,它会生成这个列表:
[[1, 2, 0], [2, 0, 1], [1, 2, 0], [5, 5, 1], [1, 2, 5]]
然后:
result = [lst[i] for (i,x) in enumerate(l) if x in list_duplicates(l)]