比较两个数组的相等元素,并从两个数组中删除条目

2024-03-28 09:11:33 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有两个numpy数组,它们的元素如下:

arr1 = [[1, 2], [3, 5], [3, 4]]
arr2 = [[2, 3], [3, 4], [6, 6]]

我希望将结果数组arr2追加或水平堆叠到 arr1包含两个数组中不存在的元素:

^{pr2}$

如您所见,[3, 4]不在预期的结果数组中。最好的numpy和pythonic实现是什么?在


Tags: numpy元素水平数组pythonicpr2arr1arr2
3条回答

怎么样:

[l for l in arr1+arr2 if (arr1+arr2).count(l)==1]

输出:

^{pr2}$

或者,如果你想让它更有效率:

c=arr1+arr2
[l for l in c if c.count(l)==1]

如果只有同一索引中的项目可以相似,则可以使用如下简单的数字方法:

In [10]: mask = ~(arr1 == arr2).all(1)

In [11]: np.vstack((arr1[mask], arr2[mask]))
Out[11]: 
array([[1, 2],
       [3, 5],
       [2, 3],
       [6, 6]])

否则,您可以根据Jaime's answer找到交集,然后将它们从组合数组中排除。在

或采用以下方法:

^{pr2}$

如果您处理的是python数组和特别短的数组,那么只需使用如下所示的列表理解:

In [204]: [i for i in arr1 + arr2 if not (i in arr1 and i in arr2)]
Out[204]: [[1, 2], [3, 5], [2, 3], [6, 6]]

请注意,它比将列表转换为元组和使用set for short arrays bit for large arrays要快得多,在这种情况下,还是使用Numpy更好:

^{4}$

我参加聚会有点晚了,但是这里有一种方法,它利用numpy的速度,只需要中等的算法开销:O(nlogn)。结果表明,对于许多大小的输入数组,运行时由类型转换的成本决定。见以下基准:

from timeit import timeit
import numpy as np

def make_inputs(N, ncol):
    global arr1, arr2, list1, list2, lot1, lot2
    # create making sure there are no duplicates *within* arr1 or arr2
    all_ = np.array(list(set(map(tuple, np.random.randint(0, 2 * N, (N + N//2, ncol))))))
    # create input of various data types
    arr1 = all_[np.random.choice(len(all_), N, False)]
    arr2 = all_[np.random.choice(len(all_), N, False)]
    list1 = arr1.tolist()
    list2 = arr2.tolist()
    lot1 = list(map(tuple, list1))
    lot2 = list(map(tuple, list2))

def np_unique_preserve_order(a, b):
    c = np.r_[a, b]
    cr = c.view(np.dtype("|S" + str(c.shape[-1] * c.dtype.itemsize)))
    uniq, inv, count = np.unique(cr.ravel(), return_inverse=True,
                                 return_counts=True)
    return c[(count==1)[inv], :]

def np_unique(a, b):
    c = np.r_[a, b]
    cr = c.view(np.dtype("|S" + str(c.shape[-1] * c.dtype.itemsize)))
    uniq, count = np.unique(cr.ravel(), return_counts=True)
    return uniq[count==1, None].view(c.dtype)

def np_sort(a, b):
    c = np.r_[a, b]
    cr = np.sort(c.view(np.dtype("|S" + str(c.shape[-1] * c.dtype.itemsize))).ravel())
    m = np.empty(cr.shape, bool)
    m[0] = True
    m[1:] = cr[:-1] != cr[1:]
    m[:-1] &= m[1:]
    return cr[m, None].view(c.dtype)

# check
make_inputs(1000, 2)
assert set(map(tuple, lot1)).symmetric_difference(set(map(tuple, lot2))) == set(map(tuple, np_sort(arr1, arr2)))
assert set(map(tuple, lot1)).symmetric_difference(set(map(tuple, lot2))) == set(map(tuple, np_unique(arr1, arr2)))
assert set(map(tuple, lot1)).symmetric_difference(set(map(tuple, lot2))) == set(map(tuple, np_unique_preserve_order(arr1, arr2)))


for N, ncol in ((10, 2), (10000, 2), (100000, 20)):
    make_inputs(N, ncol)
    results = []
    for inputs in 'lot', 'list', 'arr':
        res = []
        if inputs == 'lot':
            res.append('{:11.5f} ms'.format(timeit(f'list(set({inputs}1).symmetric_difference(set({inputs}2)))',
                 f'from __main__ import {inputs}1, {inputs}2', number=10) * 100))
        else:
            res.append('{:11.5f} ms'.format(timeit(f'list(set(map(tuple, {inputs}1)).symmetric_difference(set(map(tuple, {inputs}2))))',
                 f'from __main__ import {inputs}1, {inputs}2', number=10) * 100))

        res.append('{:11.5f} ms'.format(timeit(f'np_sort({inputs}1, {inputs}2)', f'from __main__ import {inputs}1, {inputs}2, np_sort',
                 number=10) * 100))
        res.append('{:11.5f} ms'.format(timeit(f'np_unique({inputs}1, {inputs}2)', f'from __main__ import {inputs}1, {inputs}2, np_unique',
                 number=10) * 100))
        res.append('{:11.5f} ms'.format(timeit(f'np_unique_preserve_order({inputs}1, {inputs}2)', f'from __main__ import {inputs}1, {inputs}2, np_unique_preserve_order',
                 number=10) * 100))
        results.append(res)
    results = zip(*results)
    appmin = lambda l: l + (min(l),)
    print(f'\nno rows {N}, no colums {ncol}')
    print('input type                           lot           list          array           best')
    print(f'symm_diff                ', *appmin(next(results)))
    print(f'np_sort                  ', *appmin(next(results)))
    print(f'np_unique                ', *appmin(next(results)))
    print(f'np_unique_preserve_order ', *appmin(next(results)))

输出:

^{pr2}$

对于非常小的数组,symm_diff是最快的,但是对于较大的数组,np_sort的优势很小,因为所有方法都可以使用它们最熟悉的数据类型。在

相关问题 更多 >