我试图返回一个元组列表,该元组与问题的mosr相似候选者排序,并返回原始候选者列表中该候选者的索引: 我实现了这个功能:
from sklearn.metrics.pairwise import cosine_similarity
def rank_candidates(question, candidates, embeddings, dim=300):
"""
question: a string
candidates: a list of strings (candidates) which we want to rank
embeddings: some embeddings
dim: dimension of the current embeddings
result: a list of pairs (initial position in the list, question)
"""
cosi_dic={}
most_candidates=[]
q_vec=question_to_vec(question,embeddings,dim)
for i in candidates:
can_vec=question_to_vec(i,embeddings,dim)
cosi_dic[cosine_similarity(can_vec.reshape(1,-1), q_vec.reshape(1,-1))[0][0]]=i
for i in (list(reversed(sorted(cosi_dic.keys(),)))):
most_candidates.append((candidates.index(cosi_dic[i]),cosi_dic[i]))
return most_candidates
函数question_to_vec
是一个函数,用于获得句子中嵌入向量的所有单词的平均值,这里是函数:
def question_to_vec(question, embeddings, dim=300):
"""
question: a string
embeddings: dict where the key is a word and a value is its' embedding
dim: size of the representation
result: vector representation for the question
"""
v=np.zeros(dim)
all_vectors=[]
question=question.split()
for i in question:
if i in embeddings:
all_vectors.append(embeddings[i])
if all_vectors:
v=np.mean(all_vectors, axis=0)
return v
预期输出应该是这样的:[(2,c)、(0,b)、(1,a)],如果c与输入列表候选中的索引2最相似,而a是最不相似的。但是,当我尝试运行此代码时:
wv_ranking = []
for i in range(len(validation)):
line=validation[i]
q, *ex = line
ranks = rank_candidates(q, ex, wv_embeddings)
wv_ranking.append([r[0] for r in ranks].index(0) + 1)
其中wv_embeddings
是GoogleNews-vectors-negative300的EMBBEDING,
我得到了错误:ValueError: 0 is not in list
我试着检查得到异常的那条线之间的余弦,发现所有元素的值都是零
深入研究错误后,发现在处理函数中的数据时使用字典会替换具有相同余弦相似值的值。因此,函数应如下所示:
相关问题 更多 >
编程相关推荐