numpy数组中索引到标签的最快逆操作：标签到索引的字典（散列）

def get_key_to_indexes_dic(labels): """ Builds a dictionary whose keys are the labels and whose items are all the indexes that have that particular key """ # Get the unique labels and initialize the dictionary label_set = set(labels) key_to_indexes = {} for label in label_set: key_to_indexes[label] = np.where(labels==label) return key_to_indexes

3条回答

网友

1楼 · 编辑于 2024-04-24 19:29:01

numpy_indexed包（免责声明：我是其作者）可用于以完全矢量化的方式解决此类问题，并且具有O（nlogn）最坏情况下的时间复杂性：

import numpy_indexed as npi
indices = np.arange(len(labels))
unique_labels, indices_per_label = npi.group_by(labels, indices)

请注意，对于此类功能的许多常见应用程序，例如计算组标签上的和或平均值，不计算指数的拆分列表，而是利用npi中的函数来进行比较，即，npi.分组依据（标签）.mean（一些对应的_数组），而不是在每个标签上循环索引并取这些指数的平均值。在

网友

2楼 · 编辑于 2024-04-24 19:29:01

假设标签是连续整数[0, m]，取{}，则{}的复杂度为O（n），循环中{}的复杂度为O（m*n）。但是，总体复杂度写为O（m*n）而不是O（m*n+n），参见"Big O notation" on wikipedia。在

有两件事可以提高性能：1）使用更高效的算法（较低的复杂性）和2）用快速数组操作替换Python循环。在

目前发布的其他答案正是这样做的，并且使用了非常合理的代码。然而，一个最优解既可以完全矢量化，又具有O（n）复杂度。这可以通过使用Scipy的某个较低级别函数来实现：

def sparse_hack(labels):
    from scipy.sparse._sparsetools import coo_tocsr

    labels = labels.ravel()
    n = len(labels)
    nlabels = np.max(labels) + 1

    indices = np.arange(n)
    sorted_indices = np.empty(n, int)
    offsets = np.zeros(nlabels+1, int)
    dummy = np.zeros(n, int)
    coo_tocsr(nlabels, 1, n, labels, dummy, indices, 
                             offsets, dummy, sorted_indices)

    return sorted_indices, offsets

coo_tocsr的源可以找到here。我使用它的方式是，它本质上执行一个间接的counting sort。老实说，这是一个相当模糊的方法，我建议你在其他答案中使用其中一种方法。在

网友

3楼 · 编辑于 2024-04-24 19:29:01

如果在遍历过程中使用字典存储索引，则只需遍历一次：

from collections import defaultdict

def get_key_to_indexes_ddict(labels):
    indexes = defaultdict(list)
    for index, label in enumerate(labels):
        indexes[label].append(index)

缩放看起来很像您为您的选项分析过的，对于上面的函数，它是O（N），其中N是y的大小，因为检查字典中的值是否为O（1）。在

有趣的是，既然np.where的遍历速度要快得多，只要只有少量的标签，你的函数就会更快。当有许多不同的标签时，我的速度似乎更快。在

以下是函数的缩放方式：

蓝线是你的职能，红线是我的。线样式指示不同标签的数量。{10: ':', 100: '--', 1000: '-.', 10000: '-'}。您可以看到，我的函数相对独立于标签的数量，而您的函数在有很多标签时会很快变慢。如果你的商标不多，你最好用你的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章