遍历scipy.sparse向量(或矩阵)
我在想,使用scipy.sparse来遍历稀疏矩阵中非零元素的最佳方法是什么。例如,如果我这样做:
from scipy.sparse import lil_matrix
x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2
c = 0
for i in x:
print c, i
c = c+1
那么输出结果是
0
1
2
3
4
5
6
7
8
9
10
11
12
13 (0, 0) 1.0
14
15 (0, 0) 2.0
16
17
18
19
看起来这个迭代器会访问每一个元素,而不仅仅是非零的部分。我查看了API文档
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse/lil_matrix.html
并且搜索了一下,但似乎找不到有效的解决方案。
6 个回答
5
为了遍历来自 scipy.sparse
代码部分的各种稀疏矩阵,我会使用这个简单的包装函数(注意,对于 Python 2,建议使用 xrange
和 izip
,这样在处理大矩阵时性能会更好):
from scipy.sparse import *
def iter_spmatrix(matrix):
""" Iterator for iterating the elements in a ``scipy.sparse.*_matrix``
This will always return:
>>> (row, column, matrix-element)
Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.
Parameters
----------
matrix : ``scipy.sparse.sp_matrix``
the sparse matrix to iterate non-zero elements
"""
if isspmatrix_coo(matrix):
for r, c, m in zip(matrix.row, matrix.col, matrix.data):
yield r, c, m
elif isspmatrix_csc(matrix):
for c in range(matrix.shape[1]):
for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
yield matrix.indices[ind], c, matrix.data[ind]
elif isspmatrix_csr(matrix):
for r in range(matrix.shape[0]):
for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
yield r, matrix.indices[ind], matrix.data[ind]
elif isspmatrix_lil(matrix):
for r in range(matrix.shape[0]):
for c, d in zip(matrix.rows[r], matrix.data[r]):
yield r, c, d
else:
raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
39
最快的方法是把数据转换成一个 coo_matrix
格式:
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
print "(%d, %d), %s" % (i,j,v)
80
编辑:bbtrb的方法(使用coo_matrix)比我最开始的建议要快很多,而我最开始的建议是使用nonzero。Sven Marnach建议使用itertools.izip
也能提高速度。目前最快的方法是using_tocoo_izip
:
import scipy.sparse
import random
import itertools
def using_nonzero(x):
rows,cols = x.nonzero()
for row,col in zip(rows,cols):
((row,col), x[row,col])
def using_coo(x):
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo(x):
cx = x.tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo_izip(x):
cx = x.tocoo()
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
(i,j,v)
N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)
产生了这些timeit
结果:
% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop