如何从元组列表中去重并保持原有顺序
我想去掉重复的元组,但又想保持它们出现的顺序。我查了一些类似的问题。这个问题 在numpy.array中找到唯一行 看起来很有希望,但不知怎么的对我没用。
我可以像这个回答中那样使用pandas (https://stackoverflow.com/a/14089586/566035),但我更希望不使用pandas,这样用py2exe生成的可执行文件会小一些。
import numpy as np
data = [('a','z'), ('a','z'), ('a','z'), ('1','z'), ('e','z'), ('c','z')]
#What I want is:
array([['a', 'z'],
['1', 'z'],
['e', 'z'],
['c', 'z']],
dtype='|S1')
#What I have tried:
# (1) numpy.unique, order not preserved
np.unique(data)
array([['a', 'z'],
['c', 'z'],
['1', 'z'],
['e', 'z']],
dtype='|S1')
# (2) python set, order not preserved
set(data)
set([('1', 'z'), ('a', 'z'), ('c', 'z'), ('e', 'z')])
# (3) answer here : https://stackoverflow.com/a/16973510/566035, order not preserved
a = np.array(data)
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
a[idx]
array([['1', 'z'],
['a', 'z'],
['c', 'z'],
['e', 'z']],
dtype='|S1')
3 个回答
0
根据我的测试,使用列表推导式配合集合来检查唯一性,可以把运行时间提高4倍(也就是把复杂度从O(n^2)降低到O(n))。
import functools
import time
import string
import random
# initializing list
data = [
(random.choice(string.ascii_lowercase),
random.choice(string.ascii_lowercase))
for _ in range(10_000)
]
# using reduce to isolate uniques while keeping order
start_time = time.time()
control_set = set()
result1 = functools.reduce(lambda a, b: control_set.add(b) or a+[b] if b not in control_set else a, data, [])
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.002000570297241211 seconds ---"
# creating a set and ordering them by original index
start_time = time.time()
control_set = set()
result2 = sorted(set(data), key=data.index)
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.00800013542175293 seconds ---"
# list comprehension with set-membership
start_time = time.time()
control_set = set()
result3 = [
data_element
for data_element in data
if data_element not in control_set
and (control_set.add(data_element) or True)
]
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.0010018348693847656 seconds ---"
def get_unique(rec_list):
if len(rec_list) == 1:
return rec_list
l = get_unique(rec_list[:len(rec_list)//2])
r = get_unique(rec_list[len(rec_list)//2:])
set_l = set(l)
set_r = set(r)
set_r -= (set_l.intersection(set_r))
return sorted(set_l, key=l.index) + sorted(set_r, key=r.index)
start_time = time.time()
result4 = get_unique(data)
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.11902070045471191 seconds ---"
assert result1 == result2 == result3 == result4, "one of them failed"
1
哎呀!我自己找到了答案...
seen = set()
np.array([x for x in data if x not in seen and not seen.add(x)])
# output
array([['a', 'z'],
['1', 'z'],
['e', 'z'],
['c', 'z']],
dtype='|S1')
2
这个方法在效率上不是特别好,但代码很简单易懂,适合处理小一点的列表:
sorted(set(data), key=data.index)