从成对词移动器距离得分列表构造数据帧

2024-04-25 17:24:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我想对我得到的成对句子距离(单词移动距离)列表进行PCA分析。到目前为止,我已经得到了每对句子的相似性分数。将所有成对相似性分数存储在列表中。我的主要问题是:

如何构造一个包含这些相似度得分和原始句子索引的矩阵?目前,该列表仅包含每对的分数。还没有找到将分数映射回句子本身的方法

我的理想数据框如下所示:

>             Sentence1  Sentence2  Sentence3   
 Sentence1.     1          0.5        0.8
 Sentence2      0.5        1          0.4
 Sentence3      0.8        0.4        1

但是,我的相似性评分列表如下所示,没有索引:

[0.5, 0.8, 0.4]

如何将其转换为可以运行PCA的数据帧?谢谢

----我构造成对相似性分数的步骤

# Tokenize all sentences in a column
tokenized_sentences = [s.split() for s in df[col]]

# calculate distance between 2 responses using wmd
def find_similar_docs(sentence_1, sentence_2):
   distance = model.wv.wmdistance(sentence_1, sentence_2)
   return distance

# find response pairs
pairs_sentences = list(combinations(tokenized_sentences, 2))

# get all similiarity scores between sentences
list_of_sim = []
for sent_pair in pairs_sentences:
   sim_curr_pair = find_similar_docs(sent_pair[0], sent_pair[1])
   list_of_sim.append(sim_curr_pair)

如果我用“1”而不是标记化的句子([“I”,“open”,“communication”,“culture”])作为索引,那就容易多了所以我有点被困在这里


Tags: in距离列表sentencessimfind相似性分数
1条回答
网友
1楼 · 发布于 2024-04-25 17:24:40

使用numpy生成距离矩阵,然后转换为数据帧

import numpy as np
import pandas as pd

# calculate distance between 2 responses using wmd
def find_similar_docs(sentence_1, sentence_2):
    distance = model.wv.wmdistance(sentence_1, sentence_2)
    return distance
  
# create distance matrix
tokenized_sentences = [s.split() for s in df[col]]
l = len(tokenized_sentences)
distances = np.zeros((l, l))
for i in range(l):
    for j in range(l):
        distances[i, j] = find_similar_docs(tokenized_sentences[i], tokenized_sentences[j])

# make pandas dataframe
labels = ['sentence' + str(i + 1) for i in range(l)]
df = pd.DataFrame(data=distances, index=labels, columns=labels)
print(df)

相关问题 更多 >

    热门问题