擅长:python、mysql、java
<p>你应该看看<a href="https://radimrehurek.com/gensim/tutorial.html" rel="nofollow">gensim</a>。示例起始代码如下所示:</p>
<pre><code>from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
</code></pre>
<p>在预测时,首先获得新文档的向量:</p>
^{pr2}$
<p>然后得出相似性(按最相似的排序):</p>
<pre><code>sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
</code></pre>
<p>这就像你想做的那样做一个线性扫描,但是他们有一个更优化的实现。***在</p>