通过阈值的通用句子编码器实现句子相似度

2024-04-28 17:01:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个超过1500行的数据。每行有一个句子。我正试图找出最好的方法来找出所有句子中最相似的句子。我尝试过这个example,但是处理速度太慢了,大约需要20分钟才能处理1500行数据

我使用了上一个问题中的代码,并尝试了多种类型来提高速度,但影响不大。我遇到了使用tensorflow的通用句子编码器,它似乎速度快,准确性好。我正在处理colab,你可以检查一下here

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5", "https://tfhub.dev/google/universal-sentence-encoder-lite/2"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
                                                                [11,"MAXPREDO Validation is corect"],
                                                                [12,"Move to QC"],
                                                                [13,"Cancel ASN WMS Cancel ASN"],
                                                                [14,"MAXPREDO Validation is right"],
                                                                [15,"Verify files are sent every hours for this interface from Optima"],
                                                                [16,"MAXPREDO Validation are correct"],
                                                                [17,"Move to QC"],
                                                                [18,"Verify files are not sent"]
                                                                ]))

message_embeddings = embed(messages)

for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
  print("Message: {}".format(messages[i]))
  print("Embedding size: {}".format(len(message_embedding)))
  message_embedding_snippet = ", ".join(
      (str(x) for x in message_embedding[:3]))
  print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

我在寻找什么

我想要一种方法,在这种方法中,我可以传递一个阈值示例0.90,结果应该返回所有行中彼此相似超过0.90%的数据

Data Sample
ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN   
11    | MAXPREDO Validation is corect
12    | Move to QC  
13    | Cancel ASN WMS Cancel ASN   
14    | MAXPREDO Validation is right
15    | Verify files are sent every hours for this interface from Optima
16    | MAXPREDO Validation are correct
17    | Move to QC  
18    | Verify files are not sent 

预期结果

Above data which are similar upto 0.90% should get as a result with ID

ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN
13    | Cancel ASN WMS Cancel ASN
11    | MAXPREDO Validation is corect  # even spelling is not correct
14    | MAXPREDO Validation is right
16    | MAXPREDO Validation are correct
12    | Move to QC  
17    | Move to QC 

Tags: toimportmessagemoveisasembeddingcancel
1条回答
网友
1楼 · 发布于 2024-04-28 17:01:33

有多种方法可以找到两个嵌入向量之间的相似性。 最常见的是cosine_similarity

因此,首先要计算相似性矩阵:

代码:

message_embeddings = embed(list(df['DESCRIPTION']))
cos_sim = sklearn.metrics.pairwise.cosine_similarity(message_embeddings)

得到一个具有相似值的9*9矩阵。 您可以创建此矩阵的热图以将其可视化

代码:

def plot_similarity(labels, corr_matrix):
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr_matrix,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=90)
  g.set_title("Semantic Textual Similarity")

plot_similarity(list(df['DESCRIPTION']), cos_sim)

输出:

Matrix

较暗的方框表示更相似

最后,您迭代这个cos_sim矩阵,使用threshold得到所有类似的句子:

threshold = 0.8
row_index = []
for i in range(cos_sim.shape[0]):
  if i in row_index:
    continue
  similar = [index for index in range(cos_sim.shape[1]) if (cos_sim[i][index] > threshold)]
  if len(similar) > 1:
    row_index += similar

sim_df = pd.DataFrame()
sim_df['ID'] = [df['ID'][i] for i in row_index]
sim_df['DESCRIPTION'] = [df['DESCRIPTION'][i] for i in row_index]
sim_df

数据框如下所示。
输出:

This

有不同的方法可以用来生成相似性矩阵。 您可以查看this了解更多方法

相关问题 更多 >