以可移植数据格式保存/加载scipy稀疏csr_matrix

95 投票

10 回答

76304 浏览

提问于 2025-04-17 10:47

如何以可移植的格式保存和加载一个 scipy 的稀疏矩阵 csr_matrix 呢？这个稀疏矩阵是在 Python 3（Windows 64位）上创建的，但我想在 Python 2（Linux 64位）上运行。最开始，我使用了 pickle（设置了 protocol=2 和 fix_imports=True），但是在从 Python 3.2.2（Windows 64位）转到 Python 2.7.2（Windows 32位）时，这个方法不管用，出现了错误：

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

接下来，我尝试了 numpy.save 和 numpy.load，还有 scipy.io.mmwrite() 和 scipy.io.mmread()，但是这些方法也都不行。

数据序列化可移植性稀疏矩阵科学计算 pickle csr_matrix Python 版本兼容性

10 个回答

这里是对三个最受欢迎答案的性能比较，使用的是Jupyter笔记本。输入的是一个大小为1百万行乘以10万列的随机稀疏矩阵，密度为0.001，里面有1亿个非零值：

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` / `io.mmread`

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

（注意格式已经从csr改为coo）。

`np.savez` / `np.load`

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

`cPickle`

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

注意：cPickle不适用于非常大的对象（见这个回答）。根据我的经验，它在处理一个2.7百万行乘以5万列、含有2.7亿个非零值的矩阵时无法工作。np.savez的解决方案效果很好。

结论

（基于对CSR矩阵的简单测试）cPickle是最快的方法，但它不适用于非常大的矩阵，np.savez稍微慢一点，而io.mmwrite则慢得多，生成的文件更大，并且恢复到错误的格式。所以在这里，np.savez是赢家。

回答于 2025-04-17 由 Python大师

分享举报

虽然你提到 scipy.io.mmwrite 和 scipy.io.mmread 对你没有用，我想补充一下它们是怎么工作的。这个问题在谷歌上是搜索量最高的，所以我自己最开始也是用 np.savez 和 pickle.dump，后来才转向简单明了的 scipy 函数。这些函数对我来说很好用，没尝试过的人不应该忽视它们。

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

回答于 2025-04-17 由 Python大师

分享举报

144

编辑：scipy 0.19 现在有了 scipy.sparse.save_npz 和 scipy.sparse.load_npz 这两个功能。

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

对于这两个功能，file 参数也可以是一个类似文件的对象（也就是说，可以是用 open 打开的结果），而不一定非得是文件名。

从 Scipy 用户组得到了一个答案：

一个 csr_matrix 有三个重要的数据属性：.data、.indices 和 .indptr。这三个都是简单的 ndarrays，所以可以用 numpy.save 来处理它们。用 numpy.save 或 numpy.savez 保存这三个数组，再用 numpy.load 载入它们，然后可以用以下方式重新创建稀疏矩阵对象：

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

所以举个例子：

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

回答于 2025-04-17 由 Python大师

分享举报

以可移植数据格式保存/加载scipy稀疏csr_matrix

10 个回答

io.mmwrite / io.mmread

np.savez / np.load

cPickle

结论

撰写回答

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`