<p>从协方差矩阵可以相当直接地计算相关系数,如下所示:</p>
<pre><code>import numpy as np
from scipy import sparse
def sparse_corrcoef(A, B=None):
if B is not None:
A = sparse.vstack((A, B), format='csr')
A = A.astype(np.float64)
n = A.shape[1]
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))
return coeffs
</code></pre>
<p>检查它是否正常工作:</p>
<pre><code># some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')
coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())
print(np.allclose(coeffs1, coeffs2))
# True
</code></pre>
<h2>请注意:</h2>
<p>计算协方差矩阵<code>C</code>所需的内存量将在很大程度上取决于<code>A</code>(和<code>B</code>的稀疏结构,如果给定的话)。例如,如果<code>A</code>是一个<code>(m, n)</code>矩阵,只包含一列非零值,那么<code>C</code>将是一个<code>(n, n)</code>矩阵,包含所有非零值。如果<code>n</code>很大,那么就内存消耗而言,这可能是一个非常坏的消息。</p>