在直方图分箱之前乘以距离矩阵中的距离数量

Question

我正在使用scipy.spatial.distance.pdist来计算一组坐标之间的距离，然后用numpy.histogram来对结果进行分组。目前，这个方法把每个坐标都当成一个物体来处理，但实际上我在同一个坐标上有多个物体。一个解决办法是把数组改成每个坐标出现多次，每个物体一次，但这样会大大增加数组的大小，也会让pdist的计算时间变长，因为它的计算复杂度是N^2，这样会非常耗费资源，而在我的应用中速度是很重要的。

第二种方法是处理得到的距离矩阵，让每个距离重复ninj次，其中ni是坐标i上的物体数量，nj是坐标j上的物体数量。这会把原来的MxM距离矩阵变成NxN距离矩阵，M是数组中坐标的总数，而N是物体的总数。但这样做似乎也不太划算，因为我真正需要的只是告诉直方图函数，把距离ij上的事件数量乘以ninj。换句话说，有没有办法告诉numpy.histogram，距离ij上不仅有一个物体，而是有ni*nj个物体呢？

当然，其他的想法也欢迎提出。

编辑：

这是第一种方法的一个例子。

import numpy as np from scipy import spatial import matplotlib.pyplot as plt #create array of 5 coordinates in 3D coords = np.random.random(15).reshape(5,3) '''array([[ 0.66500534, 0.10145476, 0.92528492], [ 0.52677892, 0.07756804, 0.50976737], [ 0.50030508, 0.37635556, 0.20828815], [ 0.02707651, 0.21878467, 0.55855427], [ 0.81564621, 0.82750694, 0.53083443]])''' #number of objects at each coordinate objects = np.random.randint(1,10,5) #array([5, 3, 8, 5, 1]) #create new array with coordinates for each individual object new_coords = np.zeros((objects.sum(),3)) #there's surely a simpler way to do this j=0 for coord in range(coords.shape[0]): for i in range(objects[coord]): new_coords[j] = coords[coord] j+=1 '''new_coords array([[ 0.66500534, 0.10145476, 0.92528492], [ 0.66500534, 0.10145476, 0.92528492], [ 0.66500534, 0.10145476, 0.92528492], [ 0.66500534, 0.10145476, 0.92528492], [ 0.66500534, 0.10145476, 0.92528492], [ 0.52677892, 0.07756804, 0.50976737], [ 0.52677892, 0.07756804, 0.50976737], [ 0.52677892, 0.07756804, 0.50976737], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.50030508, 0.37635556, 0.20828815], [ 0.02707651, 0.21878467, 0.55855427], [ 0.02707651, 0.21878467, 0.55855427], [ 0.02707651, 0.21878467, 0.55855427], [ 0.02707651, 0.21878467, 0.55855427], [ 0.02707651, 0.21878467, 0.55855427], [ 0.81564621, 0.82750694, 0.53083443]])''' #calculate distance matrix of old and new arrays distances_old = distance.pdist(coords) distances_new = distance.pdist(new_coords) #calculate and plot normalized histograms (typically just use np.histogram without plotting) plt.hist(distances_old, range=(0,1), alpha=.5, normed=True) (array([ 0., 0., 0., 0., 2., 1., 2., 2., 2., 1.]), array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]), <a list of 10 Patch objects>) plt.hist(distances_new, range=(0,1), alpha=.5, normed=True) (array([ 2.20779221, 0. , 0. , 0. , 1.68831169, 0.64935065, 2.07792208, 2.81385281, 0.34632035, 0.21645022]), array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]), <a list of 10 Patch objects>) plt.show()

第二种方法则是处理距离矩阵，而不是坐标矩阵，但我还没有搞定那段代码。

我觉得这两种方法都不太高效，我认为调整np.histogram的分组过程可能更有效，因为这只是基本的乘法，但我不确定怎么告诉np.histogram，让每个坐标有一个可变数量的物体来计数。

数据处理 numpy scipy 分组算法直方图计算复杂度距离矩阵物体计数

在直方图分箱之前乘以距离矩阵中的距离数量

1 个回答

撰写回答