从稀疏矩阵的行创建稀疏对角矩阵
我在Python/Scipy中处理比较大的矩阵。我需要从一个大矩阵(已经加载为coo_matrix)中提取行,并把它们用作对角线元素。目前我这样做:
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A[i,:].todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csc")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
从profile
的输出中可以看到,大部分时间都花在了get_csr_submatrix
函数上,用来提取diag_elems
。这让我觉得我可能在使用稀疏数据的表示方式上不够高效,或者提取稀疏矩阵行的方法不对。你能建议一种更好的方法来从稀疏矩阵中提取行,并以对角线的形式表示吗?
编辑
以下的变体解决了行提取的瓶颈(注意,简单地把'csc'
改成csr
是不够的,A[i,:]
也必须替换成A.getrow(i)
)。不过主要的问题是如何省略物化(.todense()
),并从行的稀疏表示中创建对角矩阵。
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A.getrow(i).todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csr")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
如果我直接从一个1行的CSR矩阵创建对角矩阵,如下所示:
diag_elems = A.getrow(i)
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1])
那么我既不能指定format="csc"
参数,也不能把ith_diags
转换成CSC格式:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/profile.py", line 70, in run
prof = prof.run(statement)
File "/usr/local/lib/python2.6/profile.py", line 456, in run
return self.runctx(cmd, dict, dict)
File "/usr/local/lib/python2.6/profile.py", line 462, in runctx
exec cmd in globals, locals
File "<string>", line 1, in <module>
File "<stdin>", line 4, in computation
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/construct.py", line 56, in spdiags
return dia_matrix((data, diags), shape=(m,n)).asformat(format)
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/base.py", line 211, in asformat
return getattr(self,'to' + format)()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/dia.py", line 173, in tocsc
return self.tocoo().tocsc()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/coo.py", line 263, in tocsc
data = np.empty(self.nnz, dtype=upcast(self.dtype))
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/sputils.py", line 47, in upcast
raise TypeError,'no supported conversion for types: %s' % args
TypeError: no supported conversion for types: object`
1 个回答
3
这是我想到的:
def computation(A):
for i in range(A.shape[0]):
idx_begin = A.indptr[i]
idx_end = A.indptr[i+1]
row_nnz = idx_end - idx_begin
diag_elems = A.data[idx_begin:idx_end]
diag_indices = A.indices[idx_begin:idx_end]
ith_diag = sparse.csc_matrix((diag_elems, (diag_indices, diag_indices)),shape=(A.shape[1], A.shape[1]))
ith_diag.eliminate_zeros()
使用Python的性能分析工具测试后,结果显示现在需要1.464秒,而之前是5.574秒。这是因为它利用了底层的稠密数组(indptr、indices、data),这些数组用来定义稀疏矩阵。简单来说,A.indptr[i]:A.indptr[i+1]表示在稠密数组中,哪些元素对应于第i行的非零值。A.data是一个包含A中所有非零值的稠密一维数组,而A.indptr则指明这些值应该放在哪一列。
我还会做更多的测试,以确保这个方法和之前的效果是一样的。我只检查了几个案例。