Numpy:从点列表中提取坐标范围相同的点,避免使用for循环
我遇到了一个关于numpy的挑战。
我有一组三维点的列表。每个点的前两个值是它在图像上的坐标,第三个值是与这个点相关的一些数值。
我想在图像上放一个二维网格,并建立一个数组,里面包含每个小块的平均值,这样每个小块都有一个平均值。
我已经用numpy实现了这个功能,但为了优化(其实我并不太需要,但现在我很好奇是否可以做到),我想找一种方法来去掉下面代码中的最后两个for循环,完全用numpy来实现。我觉得可以用np.mgrid,但我还没找到具体怎么用。你能帮我吗?
import numpy as np
points_range = np.array([2, 5, 1])
points = np.random.random((100, 3)) # points[:, i] will be referred as x, y, z for i in 0, 1, 2
points *= points_range
x_steps = 10 # We will divide the points x space in 10 columns
y_steps = 15 # We will divide the points y space in 15 rows
x_step_size = points_range[0] / x_steps
y_step_size = points_range[1] / y_steps
# Do something with that ???
X, Y = np.mgrid[0:points_range[0]:x_step_size, 0:points_range[1]:y_step_size]
# Try the same with numpy: WORK
means = [[0 for _ in range(y_steps)] for _ in range(x_steps)]
for x_step_id in range(x_steps):
for y_step_id in range(y_steps):
# Compute the z average for every points in the tile [x_step_id, y_step_id]
# -> get points that are inside this tile
p = points[(points[:, 0] >= x_step_size * x_step_id) &
(points[:, 0] < x_step_size * (x_step_id + 1)) &
(points[:, 1] >= y_step_size * y_step_id) &
(points[:, 1] < y_step_size * (y_step_id + 1))]
# -> get z average on these points
mean_z = np.mean(p[:, 2])
means[x_step_id][y_step_id] = 0 if np.isnan(mean_z) else mean_z
解决方案
这段代码比较了三种版本及其运行时间。
在我电脑上运行这段代码的输出是(时间单位为秒,1000万个点,15*15的网格):
Method for_loop_1 took 37.779709815979004
Method for_loop_2 took 14.891589879989624
Method for_plus_numpy took 16.47166681289673
Method full_numpy took 1.1438612937927246
代码:
import numpy as np
import time
points_range = np.array([2, 5, 1])
points = np.random.random((10000000, 3)) # points[:, i] will be referred as x, y, z for i in 0, 1, 2
points *= points_range
x_steps = 15 # We will divide the points x space in 10 columns
y_steps = 15 # We will divide the points y space in 15 rows
x_step_size = points_range[0] / x_steps
y_step_size = points_range[1] / y_steps
methods = []
def for_loop_1():
start = time.time()
means = [[0 for _ in range(y_steps)] for _ in range(x_steps)]
n = np.zeros_like(means)
for p in points:
tile_x = int(p[0] // x_step_size)
tile_y = int(p[1] // y_step_size)
means[tile_x][tile_y] = (p[2] + n[tile_x][tile_y] * means[tile_x][tile_y]) / (n[tile_x][tile_y] + 1)
n[tile_x][tile_y] += 1
stop = time.time()
return means, stop - start
methods.append(for_loop_1)
def for_loop_2():
start = time.time()
sums = np.zeros((x_steps, y_steps))
counts = np.zeros((x_steps, y_steps))
for point in points:
tile_index_x = int(point[0] // x_step_size)
tile_index_y = int(point[1] // y_step_size)
sums[tile_index_x, tile_index_y] += point[2]
counts[tile_index_x, tile_index_y] += 1
means = np.where(counts > 0, sums / counts, 0)
stop = time.time()
return means, stop - start
methods.append(for_loop_2)
# Try the same with numpy
def for_plus_numpy(): # My method
start = time.time()
means = [[0 for _ in range(y_steps)] for _ in range(x_steps)]
for x_step_id in range(x_steps):
for y_step_id in range(y_steps):
# Compute the z average for every points in the tile [x_step_id, y_step_id]
# -> get points that are inside this tile
p = points[(points[:, 0] >= x_step_size * x_step_id) &
(points[:, 0] < x_step_size * (x_step_id + 1)) &
(points[:, 1] >= y_step_size * y_step_id) &
(points[:, 1] < y_step_size * (y_step_id + 1))]
# -> get z average on these points
mean_z = np.mean(p[:, 2])
means[x_step_id][y_step_id] = 0 if np.isnan(mean_z) else mean_z
stop = time.time()
return means, stop - start
methods.append(for_plus_numpy)
def full_numpy():
start = time.time()
edges = np.linspace(0, points_range[0], x_steps+1), np.linspace(0, points_range[1], y_steps+1)
sums, _, _ = np.histogram2d(points[:, 0], points[:, 1], edges, weights=points[:, 2])
counts, _, _ = np.histogram2d(points[:, 0], points[:, 1], edges)
means = np.where(counts > 0, sums / counts, 0)
stop = time.time()
return means, stop - start
methods.append(full_numpy)
for method in methods:
means, duration = method()
print("Method ", method.__name__ + " took ", duration, sep="")
2 个回答
1
与其对每个方块都遍历整个列表,不如一次性遍历所有的点。对于每个点,检查它属于哪个方块,然后通过计算累积平均值来更新相应的平均值。这意味着,你只需要遍历一次列表。
import numpy as np
points_range = np.array([2, 5, 1])
points = np.random.random((100, 3)) # points[:, i] will be referred as x, y, z for i in 0, 1, 2
points *= points_range
x_steps = 10 # We will divide the points x space in 10 columns
y_steps = 15 # We will divide the points y space in 15 rows
x_step_size = points_range[0] / x_steps
y_step_size = points_range[1] / y_steps
means = np.zeros((x_steps, y_steps))
# keep track of current number of points per tile for running average
points_per_tile = np.zeros_like(means)
for point in points:
# check which tile the points belongs to
tile_index_x = int(point[0] // x_step_size)
tile_index_y = int(point[1] // y_step_size)
# get current mean and current N
current_mean = means[tile_index_x, tile_index_y]
current_points = points_per_tile[tile_index_x, tile_index_y]
# get updated mean
updated_mean = (current_mean * current_points + point[2]) / (current_points + 1)
# update values
means[tile_index_x, tile_index_y] = updated_mean
points_per_tile[tile_index_x, tile_index_y] += 1
建议修改:
(通过这个修改,效率和原作者在100万点上的解决方案相似)
# keep track of current sum of points per tile for running average
sums = np.zeros((x_steps, y_steps))
# keep track of current number of points per tile for running average
counts = np.zeros((x_steps, y_steps))
for point in points:
# check which tile the points belongs to
tile_index_x = int(point[0] // x_step_size)
tile_index_y = int(point[1] // y_step_size)
# update
sums[tile_index_x, tile_index_y] += point[2]
counts[tile_index_x, tile_index_y] += 1
means = np.where(counts>0, sums/counts, 0)
2
这里有另一种方法,它更简洁,也更高效(在处理100万个点时大约快10倍)
edges = np.linspace(0,points_range[0],x_steps+1), np.linspace(0,points_range[1],y_steps+1)
sums, _, _ = np.histogram2d(points[:,0], points[:,1], edges, weights=points[:,2])
counts, _, _ = np.histogram2d(points[:,0], points[:,1], edges)
means = np.where(counts>0, sums/counts, 0)