大d上的增量PCA

from sklearn.decomposition import IncrementalPCA import h5py db = h5py.File("db.h5","r") data = db["data"] IncrementalPCA(n_components=10, batch_size=1).fit(data) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit X = check_array(X, dtype=np.float) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array array = np.atleast_2d(array) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d ary = asanyarray(ary) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray return array(a, dtype, copy=False, order=order, subok=True) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458) File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415) File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__ arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype) MemoryError

1条回答

网友

1楼 · 发布于 2024-04-19 15:24:08

您的程序可能无法尝试将整个数据集加载到RAM中。32位/float32×1000000×1000是3.7gib。对于只有4 GiB RAM的机器，这可能是一个问题。若要检查问题是否确实存在，请尝试仅创建此大小的数组：

>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)

如果您看到一个MemoryError，您要么需要更多的RAM，要么需要一次处理一个数据块。

对于h5py数据集，我们应该避免将整个数据集传递给我们的方法，而是传递数据集的片段。一次一个。

由于我没有您的数据，让我从创建相同大小的随机数据集开始：

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

它创建了一个漂亮的3.8gib文件。

现在，如果我们在Linux中，我们可以限制程序的可用内存：

$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152

如果我们试着运行你的代码，我们会得到内存错误。（按Ctrl-D退出新的bash会话，稍后重置限制）

让我们试着解决这个问题。我们将创建一个IncrementalPCA对象，并多次调用其^{}方法，每次都提供不同的数据集片段。

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

它似乎对我有用，如果我看一下top报告的内容，内存分配将保持在2亿以下

相关问题更多 >

编程相关推荐

热门问题

热门文章