Gensim-Doc2vec模型：如何计算使用预训练Doc2vec模型获得的语料库的相似度？

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_data)] assert gensim.models.doc2vec.FAST_VERSION > -1 cores = multiprocessing.cpu_count() doc2vec_model = Doc2Vec(vector_size=200, window=5, workers=cores) doc2vec_model.build_vocab(documents) doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=30)

1条回答

网友

1楼 · 发布于 2024-04-19 02:16:33

一个月前我在gensim==3.2.0中使用了一个肮脏的解决方案（语法可能已经改变）。在

您可以将推断的向量保存为keyedvertors格式。在

from gensim.models import KeyedVectors
from gensim.models.doc2vec import Doc2Vec
vectors = dict()
# y_names = doc2vec_model.docvecs.doctags.keys()
y_names = range(len(questions))

for name in y_names:
    # vectors[name] = doc2vec_model.docvecs[name]
    vectors[str(name)] = questions[name]
f = open("question_vectors.txt".format(filename), "w")
f.write("")
f.flush()
f.close()
f = open("question_vectors.txt".format(filename), "a")
f.write("{} {}\n".format(len(questions), doc2vec_model.vector_size))
for v in vectors:
    line = "{} {}\n".format(v, " ".join(questions[v].astype(str)))
    f.write(line)
f.close()

然后你可以加载和使用大多数相似的函数

^{pr2}$

另一个解决方案（特别是如果问题的数量不是那么多）将只是将问题转换为np.数组得到余弦距离），例如

import numpy as np

questions = np.array(questions)
texts_norm = np.linalg.norm(questions, axis=1)[np.newaxis].T
norm = texts_norm * texts_norm.T

product = np.matmul(questions, questions.T)
product = product.T / norm

# Otherwise the item is the closest to itself
for j in range(len(questions)):
    product[j, j] = 0

# Gives the top 10 most similar items to the 0th question
np.argpartition(product[0], 10)

相关问题更多 >

编程相关推荐

热门问题

热门文章