在scikit learn中使用光谱双聚类之前的标准缩放数据?

2024-04-24 07:17:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个来自不同队列的数据集,我想用sklearn function Spectral Biclustering对它们进行二次聚类。 正如您在上面的链接中看到的,这种方法使用一种规范化方法来计算SVD。在

是否有必要在双聚类之前对数据进行标准化,例如使用StandardScaling(零均值和标准差为1)?因为上面的函数仍然使用一种规范化。 这就足够了还是我必须在之前将它们归一化,例如当数据来自不同的分布时?

我得到了不同的结果有和没有标准缩放,我无法找到信息在original paper如果有必要或没有。在

您可以找到我的dataset的代码和示例。我不知道这些数据是真的。最后我计算了consensus score来比较这两个双色团。不幸的是,集群并不相同。在

我也尝试了人工数据(见示例最后一个链接),这里的结果是相同的,但与真实数据不同。在

那么我怎么知道哪种方法是正确的呢?

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.cluster.bicluster import SpectralBiclustering
from sklearn.metrics import consensus_score
from sklearn.preprocessing import StandardScaler

n_clusters = (4, 4)

data_org = pd.read_csv('raw_data_biclustering.csv', sep=',', index_col=0) 


# scale data & transform to dataframe
data_scaled = StandardScaler().fit_transform(data_org)
data_scaled = pd.DataFrame(data_scaled, columns=data_org.columns, index=data_org.index)


# plot original clusters
plt.imshow(data_scaled, aspect='auto', vmin=-3, vmax=5)
plt.title("Original dataset")
plt.show()


data_type = ['none_scaled', 'scaled']
data_all = [data_org, data_scaled]

models_all = []

for name, data in zip(data_type,data_all):

    # spectral biclustering on the shuffled dataset
    model = SpectralBiclustering(n_clusters=n_clusters, method='bistochastic'
                                         , svd_method='randomized', n_jobs=-1
                                         , random_state=0
                                         )
    model.fit(data)


    newOrder_row = [list(r) for r in zip(model.row_labels_, data.index)]
    newOrder_row.sort(key=lambda k: (k[0], k[1]), reverse=False)
    order_row = [i[1] for i in newOrder_row]

    newOrder_col = [list(c) for c in zip(model.column_labels_, [int(x) for x in data.keys()])]
    newOrder_col.sort(key=lambda k: (k[0], k[1]), reverse=False)
    order_col = [i[1] for i in newOrder_col]

    # reorder the data matrix
    X_plot = data_scaled.copy()
    X_plot = X_plot.reindex(order_row) # rows
    X_plot = X_plot[[str(x) for x in order_col]] # columns

    # use clustermap without clustering
    cm=sns.clustermap(X_plot, method=None, metric=None, cmap='viridis'
                  ,row_cluster=False, row_colors=None
                  , col_cluster=False, col_colors=None
                  , yticklabels=1, xticklabels=1
                  , standard_scale=None, z_score=None, robust=False
                  , vmin=-3, vmax=5
                  ) 

    ax = cm.ax_heatmap

    # set labelsize smaller
    cm_ax = plt.gcf().axes[-2]
    cm_ax.tick_params(labelsize=5.5)


    # plot lines for the different clusters
    hor_lines = [sum(item) for item in model.biclusters_[0]]
    hor_lines = list(np.cumsum(hor_lines[::n_clusters[1]]))

    ver_lines = [sum(item) for item in model.biclusters_[1]]
    ver_lines = list(np.cumsum(ver_lines[:n_clusters[0]]))

    for pp in range(len(hor_lines)-1):
        cm.ax_heatmap.hlines(hor_lines[pp],0,X_plot.shape[1], colors='r')

    for pp in range(len(ver_lines)-1):
        cm.ax_heatmap.vlines(ver_lines[pp],0,X_plot.shape[0], colors='r')

    # title
    title = name+' - '+str(n_clusters[1])+'-'+str(n_clusters[0])
    plt.title(title)
    cm.savefig(title,dpi=300)
    plt.show() 

    # save models
    models_all.append(model)

# compare models    
score = consensus_score(models_all[0].biclusters_, models_all[1].biclusters_)
print("consensus score between: {:.1f}".format(score))    

Tags: 数据inimportfordatamodelplotcm