<p>您仍然可以使用<a href="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise" rel="nofollow noreferrer">sklearn.metrics.pairwise</a>方法处理稀疏矩阵/数组:</p>
<pre><code># I've executed your example up to (including):
# ...
clf.fit(df['a'] + " " + df['b'])
A = clf.transform(df['a'])
B = clf.transform(df['b'])
from sklearn.metrics.pairwise import *
</code></pre>
<p><code>paired_cosine_distances</code>将向您显示字符串有多远或有多大差异(比较两列中的值“逐行”)</p>
<p><code>0</code>-表示完全匹配</p>
^{pr2}$
<p><code>cosine_similarity</code>将比较第<code>a</code>列的第一个字符串与第<code>b</code>(<strong>行1</strong>)中的所有字符串;第二个列<code>a</code>与第{<cd5>}(<strong>行2</strong>)中的所有字符串,依此类推。。。在</p>
<pre><code>In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0. , 1. , 0. , 0. ],
[ 1. , 0. , 0.74162106, 0. ],
[ 0.43929881, 0. , 0.72562753, 0. ],
[ 0. , 0. , 0. , 1. ]])
In [141]: A
Out[141]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
In [142]: B
Out[142]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
</code></pre>
<p>注意:所有的计算都是用<strong>稀疏的</strong>矩阵完成的-我们没有在内存中解压缩它们!在</p>