通过求和聚合Numpy数组
我有一个形状为 (4320,8640)
的Numpy数组。我想把它变成一个形状为 (2160,4320)
的数组。
你会发现,新数组中的每个单元格对应旧数组中的一个2x2的单元格块。我希望新数组中的单元格值是旧数组中这个块的所有值的总和。
我可以这样来实现:
import numpy
#Generate an example array
arr = numpy.random.randint(10,size=(4320,8640))
#Perform the transformation
arrtrans = numpy.array([ [ arr[y][x]+arr[y+1][x]+arr[y][x+1]+arr[y+1][x+1] for x in range(0,8640,2)] for y in range(0,4320,2)])
不过这样做速度比较慢,而且看起来不太好。
有没有更好的方法可以用Numpy(或者其他兼容的库)来实现这个呢?
3 个回答
1
你正在对原始数组进行滑动窗口操作。关于滑动窗口、numpy和python的问题和答案在StackOverflow上有很多。通过调整数组的步幅,这个过程可以大大加快。这里有一个通用的函数,可以返回带或不带重叠的(x,y)窗口。使用这个步幅技巧似乎比@mskimm的解决方案稍微慢一点,但在你的工具箱里有这样一个功能是很不错的。这个函数不是我写的,是在Efficient Overlapping Windows with Numpy上找到的。
import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product
def norm_shape(shape):
'''
Normalize numpy array shapes so they're always expressed as a tuple,
even for one-dimensional shapes.
Parameters
shape - an int, or a tuple of ints
Returns
a shape tuple
from http://www.johnvinyard.com/blog/?p=268
'''
try:
i = int(shape)
return (i,)
except TypeError:
# shape was not a number
pass
try:
t = tuple(shape)
return t
except TypeError:
# shape was not iterable
pass
raise TypeError('shape must be an int, or a tuple of ints')
def sliding_window(a,ws,ss = None,flatten = True):
'''
Return a sliding window over a in any number of dimensions
Parameters:
a - an n-dimensional numpy array
ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size
of each dimension of the window
ss - an int (a is 1D) or tuple (a is 2D or greater) representing the
amount to slide the window in each dimension. If not specified, it
defaults to ws.
flatten - if True, all slices are flattened, otherwise, there is an
extra dimension for each dimension of the input.
Returns
an array containing each n-dimensional window from a
from http://www.johnvinyard.com/blog/?p=268
'''
if None is ss:
# ss was not provided. the windows will not overlap in any direction.
ss = ws
ws = norm_shape(ws)
ss = norm_shape(ss)
# convert ws, ss, and a.shape to numpy arrays so that we can do math in every
# dimension at once.
ws = np.array(ws)
ss = np.array(ss)
shape = np.array(a.shape)
# ensure that ws, ss, and a.shape all have the same number of dimensions
ls = [len(shape),len(ws),len(ss)]
if 1 != len(set(ls)):
error_string = 'a.shape, ws and ss must all have the same length. They were{}'
raise ValueError(error_string.format(str(ls)))
# ensure that ws is smaller than a in every dimension
if np.any(ws > shape):
error_string = 'ws cannot be larger than a in any dimension. a.shape was {} and ws was {}'
raise ValueError(error_string.format(str(a.shape),str(ws)))
# how many slices will there be in each dimension?
newshape = norm_shape(((shape - ws) // ss) + 1)
# the shape of the strided array will be the number of slices in each dimension
# plus the shape of the window (tuple addition)
newshape += norm_shape(ws)
# the strides tuple will be the array's strides multiplied by step size, plus
# the array's strides (tuple addition)
newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
strided = ast(a,shape = newshape,strides = newstrides)
if not flatten:
return strided
# Collapse strided so that it has one more dimension than the window. I.e.,
# the new array is a flat list of slices.
meat = len(ws) if ws.shape else 0
firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
dim = firstdim + (newshape[-meat:])
# remove any dimensions with size 1
dim = filter(lambda i : i != 1,dim)
return strided.reshape(dim)
用法:
# 2x2 windows with NO overlap
b = sliding_window(arr, (2,2), flatten = False)
c = b.sum((1,2))
使用numpy.einsum
可以大约提高24%的性能。
c = np.einsum('ijkl -> ij', b)
有一个StackOverflow的问答例子如何像Matlab的blkproc(blockproc)函数一样高效处理numpy数组,选中的答案对你会有帮助。
9
当窗口的大小正好适合数组时,使用 np.sum
将数组重新调整为更多的维度,并把多余的维度合并起来,这种做法在使用numpy时是比较标准的方式:
>>> a = np.random.rand(4320,8640)
>>> a.shape
(4320, 8640)
>>> a_small = a.reshape(2160, 2, 4320, 2).sum(axis=(1, 3))
>>> a_small.shape
(2160, 4320)
>>> np.allclose(a_small[100, 203], a[200:202, 406:408].sum())
True
1
我不确定你想要的那个包是否存在,但这段代码运行起来会快很多。
>>> arrtrans2 = arr[::2, ::2] + arr[::2, 1::2] + arr[1::2, ::2] + arr[1::2, 1::2]
>>> numpy.allclose(arrtrans, arrtrans2)
True
这里的 ::2
和 1::2
分别表示的是 0, 2, 4, ...
和 1, 3, 5, ...
这样的数字序列。