如何从相似矩阵绘制MDS？

import pandas from sklearn import manifold import matplotlib.pyplot as plt data = pandas.read_table("file.csv", ";", header=0, index_col=0) mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed") mds.fit(data) points = mds.embedding_ # Prepare axes ax = plt.axes([0,0,2,2]) ax.set_aspect(aspect='equal') # Plot points plt.scatter(points[:,0], points[:,1], color='silver', s=150) # Add labels for i in range(data.shape[0]): ax.annotate(data.index[i], (points[i,0], points[i,1]), color='blue') #plt.show() # Open display and show at screen plt.savefig('out.png', format='png', bbox_inches='tight') # PNG #plt.savefig('out.jpg', format='jpg', bbox_inches='tight') # JPG

1条回答

网友

1楼 · 发布于 2024-06-06 16:24:13

我与XLSTAT（一个excel扩展）做了一个比较，以尝试很多场景，并比较如何做什么

第一：我的输入矩阵是一个“相似性”矩阵，因为我可以把它解释为：“a和a是100%相等的”。由于MDS将相异矩阵作为输入，因此我必须应用转换

在Literature Ricco Rakotomalala's french course on data science (p 208-209)中，简单的方法是将最大值减去每个单元格（进行“1单元格”操作）。因此，您可以轻松地制作python程序，或者（我会跟踪每个矩阵）一个AWK预处理程序：

相似到不同-简单.awk

# We keep the tags around the CSV matrix
# X ; Word1 ; Word2 ; ...
# Header
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word/tag
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (1 - similarity)
    NUM = $i;
    VAL = 1 - NUM;
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

可以使用以下命令调用它：

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv

更复杂的计算方法（对不起，我找不到引用：（）是基于每个单元格上的另一个转换：

$(sii + si'i' - 2 * sii')^1/2$

如果对角线不包含相同的值（我看到了there一个共现矩阵…它应该应用于他的CA），那么这种方法似乎是完全适合的。在我的例子中，由于对角线总是满1，我将其减少为：

$(2 - 2 * sii')^1/2$

因此，进行此转换的AWK程序（由于我的数据，我实现了简化的程序）是：

相似到不相似性复合体.awk

# Header
# X ; Word1 ; Word2 ; ...
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (2 - 2 * similarity)^-1/2
    NUM = $i;
    VAL = sqrt(2 - 2 * NUM);
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

您可以使用以下命令调用它：

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv

当我使用Kruskal的应力来检查哪个版本更好时…在我的例子中，简单的相似性与不相似性（1-单元格）是最好的（我将应力保持在0,34和0,32之间…这是不好的…其中复合体显示的值大于0,34，这更糟）

相关问题更多 >

编程相关推荐

热门问题

热门文章