在Python中动态增大大小的Numpy数组

Question

我最近遇到了在Python中需要一个增量的Numpy数组的情况，因为找不到现成的解决方案，所以我自己实现了一个。我在想，我的方法是否是最好的，或者你们有没有其他的想法。

问题是，我有一个二维数组（这个程序可以处理多维数组），但它的大小事先并不知道，而且需要在一个方向上不断地往数组里添加数据（比如说，我需要多次调用np.vstack）。每次添加数据时，我需要取出数组，沿着第0轴进行排序，还要做其他操作，所以我不能先构建一个很长的数组列表，然后一次性用np.vstack把这个列表合并起来。由于内存分配是比较耗费资源的，我选择了增量数组的方法，也就是每次增加一个比我实际需要的大小更大的量（我使用50%的增量），这样可以减少内存分配的次数。

我写了这个代码，你可以在下面的代码块中看到：

class ExpandingArray:

    __DEFAULT_ALLOC_INIT_DIM = 10   # default initial dimension for all the axis is nothing is given by the user
    __DEFAULT_MAX_INCREMENT = 10    # default value in order to limit the increment of memory allocation

    __MAX_INCREMENT = []    # Max increment
    __ALLOC_DIMS = []       # Dimensions of the allocated np.array
    __DIMS = []             # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS)

    __ARRAY = []            # Allocated array

    def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None):
        self.__DIMS = np.array(initData.shape)

        self.__MAX_INCREMENT = maxIncrement
        if self.__MAX_INCREMENT == None:
            self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT

        # Compute the allocation dimensions based on user's input
        if allocInitDim == None:
            allocInitDim = self.__DIMS.copy()

        while np.any( allocInitDim < self.__DIMS  ) or np.any(allocInitDim == 0):
            for i in range(len(self.__DIMS)):
                if allocInitDim[i] == 0:
                    allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM
                if allocInitDim[i] < self.__DIMS[i]:
                    allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT)

        # Allocate memory 
        self.__ALLOC_DIMS = allocInitDim
        self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype)

        # Set initData 
        sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
        self.__ARRAY[sliceIdxs] = initData

    def shape(self):
        return tuple(self.__DIMS)

    def getAllocArray(self):
        return self.__ARRAY

    def getDataArray(self):
        """
        Get the view of the array with data
        """
        sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
        return self.__ARRAY[sliceIdxs]

    def concatenate(self,X,axis=0):
        if axis > len(self.__DIMS):
            print "Error: axis number exceed the number of dimensions"
            return

        # Check dimensions for remaining axis 
        for i in range(len(self.__DIMS)):
            if i != axis:
                if X.shape[i] != self.shape()[i]:
                    print "Error: Dimensions of the input array are not consistent in the axis %d" % i
                    return

        # Check whether allocated memory is enough 
        needAlloc = False
        while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]:
            needAlloc = True
            # Increase the __ALLOC_DIMS 
            self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT)

        # Reallocate memory and copy old data 
        if needAlloc:
            # Allocate 
            newArray = np.zeros(self.__ALLOC_DIMS)
            # Copy 
            sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
            newArray[sliceIdxs] = self.__ARRAY[sliceIdxs]
            self.__ARRAY = newArray

        # Concatenate new data 
        sliceIdxs = []
        for i in range(len(self.__DIMS)):
            if i != axis:
                sliceIdxs.append(slice(self.__DIMS[i]))
            else:
                sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i]))

        self.__ARRAY[sliceIdxs] = X
        self.__DIMS[axis] += X.shape[axis]

这个代码的性能明显优于使用vstack/hstack进行多次随机大小的合并。

我想知道的是：这样做是最好的方法吗？Numpy中有没有现成的解决方案可以做到这一点？

另外，如果能重载np.array的切片赋值操作就好了，这样一旦用户在实际维度之外赋值，就会自动调用ExpandingArray.concatenate()。请问怎么实现这样的重载呢？

测试代码：我在这里也贴出了一些我用来比较vstack和我自己方法的代码。我添加了最大长度为100的随机数据块。

import time

N = 10000

def performEA(N):
    EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000)
    for i in range(N):
        nNew = np.random.random_integers(low=1,high=100,size=1)
        X = np.random.rand(nNew,2)
        EA.concatenate(X,axis=0)
        # Perform operations on EA.getDataArray()
    return EA

def performVStack(N):
    A = np.zeros((0,2))
    for i in range(N):
        nNew = np.random.random_integers(low=1,high=100,size=1)
        X = np.random.rand(nNew,2)
        A = np.vstack((A,X))
        # Perform operations on A
    return A

start_EA = time.clock()
EA = performEA(N)
stop_EA = time.clock()

start_VS = time.clock()
VS = performVStack(N)
stop_VS = time.clock()

print "Elapsed Time EA: %.2f" % (stop_EA-start_EA)
print "Elapsed Time VS: %.2f" % (stop_VS-start_VS)

性能优化内存管理 numpy 多维数组数据合并数组操作切片赋值增量数组

在Python中动态增大大小的Numpy数组

2 个回答

撰写回答