Numpy高效构建稀疏coo_matrix或更快的列表扩展

3 投票

2 回答

1270 浏览

提问于 2025-04-18 00:11

我有一个包含10万个项目的列表，每个项目都有一组索引。我想把这些数据放进一个布尔稀疏矩阵里，以便进行向量乘法。但是我的代码运行得没有我想象中那么快，所以我在寻找一些性能优化的建议，或者是其他方法来把这些数据放进矩阵里。

rows = []
cols = []
for i, item in enumerate(items):
    indices = item.getIndices()
    rows += [i]*len(indices)
    cols += indices

data = np.ones(len(rows), dtype='?')
mat = coo_matrix(data,(rows,cols)),shape=(len(items),totalIndices),dtype='?')
mat = mat.tocsr()

最后，行和列的列表中有80万个项目，而仅仅是扩展这些列表就占用了构建时间的16%和13%。把数据转换成coo_matrix又占用了12%。而枚举的过程占用了13%。这些数据是我通过line_profiler工具得到的，我使用的是Python 3.3。

性能优化数据转换稀疏矩阵列表扩展向量乘法 coo_matrix line_profiler

2 个回答

很多稀疏矩阵的算法需要对数据进行两次处理，第一次是为了确定稀疏矩阵的大小，第二次是把正确的数值填进去。所以，也许可以尝试一下这样的做法：

total_len = 0
for item in items:
    total_len += len(item.getIndices())

rows = np.empty((total_len,), dtype=np.int32)
cols = np.empty((total_len,), dtype=np.int32)

total_len = 0
for i, item in enumerate(items):
    indices = item.getIndices()
    len_ = len(indices)
    rows[total_len:total_len + len_] = i
    cols[total_len:total_len + len_] = indices
    total_len += len_

然后再进行你现在正在做的事情。你也可以直接构建CSR矩阵，跳过COO矩阵，这样也能节省一些时间。在第一次运行确定总大小之后，你可以这样做：

indptr = np.empty((len(items) + 1,), dtype=np.int32)
indptr[0] = 0
indices = np.empty((total_len,), dtype=np.int32)

for i, item in enumerate(items):
    item_indices = item.getIndices()
    len_ = len(item_indices)
    indptr[i+1] = indptr[i] + len_
    indices[indptr[i]:indptr[i+1]] = item_indices

data = np.ones(total_len,), dtype=np.bool)
mat = csr_matrix((data, indices, indptr))

回答于 2025-04-18 由 Python大师

分享举报

我能做到的最好就是：

def foo3(items,totalIndices):
    N = len(items)
    cols=[]
    cnts=[]
    for item in items:
        indices = getIndices(item)
        cols += indices
        cnts.append(len(indices))
    rows = np.arange(N).repeat(cnts) # main change
    data = np.ones(rows.shape, dtype=bool)
    mat = sparse.coo_matrix((data,(rows,cols)),shape=(N,totalIndices))
    mat = mat.tocsr()
    return mat

对于100000个项目来说，速度只提升了50%。

回答于 2025-04-18 由 Python大师

分享举报

Numpy高效构建稀疏coo_matrix或更快的列表扩展

2 个回答

撰写回答