<p>我建议使用<code>scipy.sparse</code>中的稀疏矩阵</p>
<pre><code>from scipy.sparse import csr_matrix, coo_matrix
from sklearn.metrics.pairwise import cosine_similarity
input="""Doc, Term, TFIDF score
1, apples, 0.3
1, bananas, 0.7
2, apples, 0.1
2, pears, 0.9
3, apples, 0.6
3, bananas, 0.2
3, pears, 0.2"""
voc = {}
# sparse matrix representation: the coefficient
# with coordinates (rows[i], cols[i]) contains value data[i]
rows, cols, data = [], [], []
for line in input.split("\n")[1:]: # dismiss header
doc, term, tfidf = line.replace(" ", "").split(",")
rows.append(int(doc))
# map each vocabulary item to an int
if term not in voc:
voc[term] = len(voc)
cols.append(voc[term])
data.append(float(tfidf))
doc_term_matrix = coo_matrix((data, (rows, cols)))
# compressed sparse row matrix (type of sparse matrix with fast row slicing)
sparse_row_matrix = doc_term_matrix.tocsr()
print("Sparse matrix")
print(sparse_row_matrix.toarray()) # convert to array
# compute similarity between each pair of documents
similarities = cosine_similarity(sparse_row_matrix)
print("Similarity matrix")
print(similarities)
</code></pre>
<p>输出:</p>
^{pr2}$