python中获取大特征向量最近10个欧几里德邻域的最快方法

2条回答

网友

1楼 · 编辑于 2024-04-18 16:43:14

def topTen(M):
    i,j = np.triu_indices(M.shape[0], 1)
    dist_sq = np.einsum('ij,ij->i', M[i]-M[j], M[i]-M[j])
    max_i=np.argpartition(dist_sq, 10)[:10]
    max_o=np.argsort(dist_sq[max_i])
    return np.vstack((i[max_i][max_o], j[max_i][max_o], dist_sq[max_i][max_o]**.5)).T

这应该是相当快的，因为它只做排序和前10的平方根，这是很长的步骤（在循环之外）。在

网友

2楼 · 编辑于 2024-04-18 16:43:14

我将把这个作为答案，但我承认这并不是这个问题的真正解决方案，因为它只适用于较小的阵列。问题是，如果你想快速避免循环，你就需要一次计算所有成对的距离，这意味着内存复杂度是按输入平方的顺序排列的。。。假设10000行*10000行*3000个元素/行*4个字节/行（假设我们使用float32）≈1TB（！）所需内存（实际上可能是两倍，因为您可能需要两个相同大小的数组）。所以，虽然这是可能的，但对于这种尺寸是不实际的。下面的代码展示了如何实现这一点（大小除以100）。在

import numpy as np

# Row length
n = 30
# Number of rows
m = 100
# Number of top elements
k = 10

# Input data
data = np.random.random((m, n))
# Tile the data in two different dimensions
data1 = np.tile(data[:, :, np.newaxis], (1, 1, m))
data2 = np.tile(data.T[np.newaxis, :, :], (m, 1, 1))
# Compute pairwise squared distances
dist = np.sum(np.square(data1 - data2), axis=1)
# Fill lower half with inf to avoid repeat and self-matching
dist[np.tril_indices(m)] = np.inf
# Find smallest distance for each row
i = np.arange(m)
j = np.argmin(dist, axis=1)
dmin = dist[i, j]
# Pick the top K smallest distances
idx = np.stack((i, j), axis=1)
isort = dmin.argsort()

# Top K indices pairs (K x 2 matrix)
top_idx = idx[isort[:k], :]
# Top K smallest distances
top_dist = np.sqrt(dmin[isort[:k]])

相关问题更多 >

编程相关推荐

热门问题

热门文章

python中获取大特征向量最近10个欧几里德邻域的最快方法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >