擅长:python、mysql、java
<p>同样的问题。我有一个很大的非稀疏矩阵。它很适合内存,但是<code>cosine_similarity</code>由于任何未知原因而崩溃,可能是因为它们在某个地方复制了太多的矩阵。所以我让它比较“左边”的一小批行,而不是整个矩阵:</p>
<pre><code>import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_n_space(m1, m2, batch_size=100):
assert m1.shape[1] == m2.shape[1]
ret = np.ndarray((m1.shape[0], m2.shape[0]))
for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
start = row_i * batch_size
end = min([(row_i + 1) * batch_size, m1.shape[0]])
if end <= start:
break # cause I'm too lazy to elegantly handle edge cases
rows = m1[start: end]
sim = cosine_similarity(rows, m2) # rows is O(1) size
ret[start: end] = sim
return ret
</code></pre>
<p>对我来说没有崩溃;YMMV。尝试不同的批量以加快速度。我以前一次只比较一行,在我的机器上花了大约30倍的时间。</p>
<p>愚蠢而有效的理智检查:</p>
<pre><code>import random
while True:
m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
n = np.random.rand(random.randint(1, 100), m.shape[1])
assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))
</code></pre>