如何使用BERT计算1000个随机示例的余弦相似度

-1 投票
1 回答
29 浏览
提问于 2025-04-12 21:57

我正在尝试使用bert-base-uncased来计算1000个问题和1000个答案之间的余弦相似度,然后我想找出最相似的5个答案,接着计算最好的答案和前5个答案的准确率。但是我得到的结果总是0.0的准确率,而且答案之间似乎没有相似性。

sample_1000_quest = train_ds['questions'].sample(1000)
sample_1000_answer = train_ds['answers'].sample(1000)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True).eval()


selected_question = sample_1000_quest.iloc[1]

selected_question_idx = sample_1000_quest.index.get_loc(30574)


encoded_question = tokenizer_bert(selected_question, return_tensors='pt', padding=True, truncation=True)


with torch.no_grad():
    outputs = model_bert(**encoded_question)
    question_embedding = outputs.last_hidden_state.mean(dim=1)


encoded_answers = []
answer_embeddings = []
for answer in sample_1000_answer:
    encoded_answer = tokenizer_bert(answer, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model_bert(**encoded_answer.to(device))
        answer_embedding = outputs.last_hidden_state.mean(dim=1)
        answer_embeddings.append(answer_embedding)


similarities = []
for answer_embedding in answer_embeddings:
    similarity = cosine_similarity(question_embedding, answer_embedding)
    similarities.append(similarity.item())


most_similar_indices = np.argsort(similarities)[-5:][::-1]


ground_truth_idx = train_ds['answers'].iloc[selected_question_idx]


top1_accuracies = []
top5_accuracies = []

top1_idx = most_similar_indices[0]
top1_accuracy = 1 if top1_idx == ground_truth_idx else 0
top5_accuracy = 1 if ground_truth_idx in most_similar_indices else 0

top1_accuracies.append(top1_accuracy)
top5_accuracies.append(top5_accuracy)

print("Selected Question:", selected_question)
print("Most similar 5 asnwer:")
for i, idx in enumerate(most_similar_indices):
    print(f"{i+1}. {sample_1000_answer.iloc[idx]}")

print("Top-1 Accuracy:", top1_accuracy)
print("Top-5 Accuracy:", top5_accuracy)

输出:

Selected Question:  bir sunum oluşturmak için beş adım yazın.
Most similar 5 asnwer:
1.  doğum günü gülüm bütün yaz aldığım en güzel hediyeydi.
2.  bu deneyin amacı ilkeleri anlamaktır.
3.  bir satış elemanı sunum yapıyor.
4.  hangi konuda yardıma ihtiyacın olduğunu söyle.
5.  konuşmanın içeriği, projede bir sonraki adım için onay almakla ilgilidir.
Top-1 Accuracy: 0
Top-5 Accuracy: 0

1 个回答

0

bert-base-uncased 是一个主要在英语上进行预训练的模型。如果你想尝试其他语言的模型,可以选择一个为土耳其语预训练的模型,比如 dbmdz/bert-base-turkish-cased

另外,你的数据集相对较大,但你只计算正确的5个索引的准确率,这样的标准似乎有点苛刻。与其仅仅判断答案是对还是错(0或1),不如找个方法来评估预测的答案和期望答案之间的距离,这样几乎所有的答案都会得0分。你也可以考虑使用更宽松的评估标准,比如前10名准确率或前25名准确率,看看这样是否能得到更高的分数。

撰写回答