Python中的主成分分析(PCA)

2024-04-25 19:11:18 发布

您现在位置:Python中文网/ 问答频道 /正文


Tags: python
3条回答

另一个使用numpython的PCA。和“道格”的想法一样,但那一个没跑。

from numpy import array, dot, mean, std, empty, argsort
from numpy.linalg import eigh, solve
from numpy.random import randn
from matplotlib.pyplot import subplots, show

def cov(X):
    """
    Covariance matrix
    note: specifically for mean-centered data
    note: numpy's `cov` uses N-1 as normalization
    """
    return dot(X.T, X) / X.shape[0]
    # N = data.shape[1]
    # C = empty((N, N))
    # for j in range(N):
    #   C[j, j] = mean(data[:, j] * data[:, j])
    #   for k in range(j + 1, N):
    #       C[j, k] = C[k, j] = mean(data[:, j] * data[:, k])
    # return C

def pca(data, pc_count = None):
    """
    Principal component analysis using eigenvalues
    note: this mean-centers and auto-scales the data (in-place)
    """
    data -= mean(data, 0)
    data /= std(data, 0)
    C = cov(data)
    E, V = eigh(C)
    key = argsort(E)[::-1][:pc_count]
    E, V = E[key], V[:, key]
    U = dot(data, V)  # used to be dot(V.T, data.T).T
    return U, E, V

""" test data """
data = array([randn(8) for k in range(150)])
data[:50, 2:4] += 5
data[50:, 2:5] += 5

""" visualize """
trans = pca(data, 3)[0]
fig, (ax1, ax2) = subplots(1, 2)
ax1.scatter(data[:50, 0], data[:50, 1], c = 'r')
ax1.scatter(data[50:, 0], data[50:, 1], c = 'b')
ax2.scatter(trans[:50, 0], trans[:50, 1], c = 'r')
ax2.scatter(trans[50:, 0], trans[50:, 1], c = 'b')
show()

这和短得多的

from sklearn.decomposition import PCA

def pca2(data, pc_count = None):
    return PCA(n_components = 4).fit_transform(data)

据我所知,使用特征值(第一种方法)对于高维数据和更少的样本更好,而使用奇异值分解则更好,如果您有更多的样本而不是维度。

尽管已经接受了另一个答案,但我还是发布了我的答案;接受的答案依赖于deprecated function;此外,这个被弃用的函数基于奇异值分解(SVD),它(尽管完全有效)是计算PCA的两种通用技术中内存和处理器更密集的。这在这里特别相关,因为OP中的数据数组的大小。使用基于协方差的PCA,计算流中使用的数组只是144x 144,而不是26424x144(原始数据数组的维数)。

这里有一个使用SciPy中的linalg模块的PCA的简单工作实现。由于此实现首先计算协方差矩阵,然后在此阵列上执行所有后续计算,因此它使用的内存远远少于基于SVD的PCA。

(除了import语句之外,NumPy中的linalg模块也可以在下面的代码没有任何更改的情况下使用,该语句来自NumPy import linalg as LA

本PCA实施的两个关键步骤是:

  • 计算协方差矩阵

  • 取这个矩阵的特征值

在下面的函数中,参数dims_rescaled_data是指rescaled数据矩阵中所需的维数;该参数的默认值只有二维,但下面的代码不限于二维,但可以是小于原始数据数组列数的任意值。


def PCA(data, dims_rescaled_data=2):
    """
    returns: data transformed in 2 dims/columns + regenerated original data
    pass in: data as 2D NumPy array
    """
    import numpy as NP
    from scipy import linalg as LA
    m, n = data.shape
    # mean center the data
    data -= data.mean(axis=0)
    # calculate the covariance matrix
    R = NP.cov(data, rowvar=False)
    # calculate eigenvectors & eigenvalues of the covariance matrix
    # use 'eigh' rather than 'eig' since R is symmetric, 
    # the performance gain is substantial
    evals, evecs = LA.eigh(R)
    # sort eigenvalue in decreasing order
    idx = NP.argsort(evals)[::-1]
    evecs = evecs[:,idx]
    # sort eigenvectors according to same index
    evals = evals[idx]
    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    evecs = evecs[:, :dims_rescaled_data]
    # carry out the transformation on the data using eigenvectors
    # and return the re-scaled data, eigenvalues, and eigenvectors
    return NP.dot(evecs.T, data.T).T, evals, evecs

def test_PCA(data, dims_rescaled_data=2):
    '''
    test by attempting to recover original data array from
    the eigenvectors of its covariance matrix & comparing that
    'recovered' array with the original data
    '''
    _ , _ , eigenvectors = PCA(data, dim_rescaled_data=2)
    data_recovered = NP.dot(eigenvectors, m).T
    data_recovered += data_recovered.mean(axis=0)
    assert NP.allclose(data, data_recovered)


def plot_pca(data):
    from matplotlib import pyplot as MPL
    clr1 =  '#2026B2'
    fig = MPL.figure()
    ax1 = fig.add_subplot(111)
    data_resc, data_orig = PCA(data)
    ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)
    MPL.show()

>>> # iris, probably the most widely used reference data set in ML
>>> df = "~/iris.csv"
>>> data = NP.loadtxt(df, delimiter=',')
>>> # remove class labels
>>> data = data[:,:-1]
>>> plot_pca(data)

下图是虹膜数据上此PCA函数的可视化表示。如您所见,2D转换清晰地将类I与类II和类III分离(但类II与类III不分离,类III实际上需要另一个维度)。

enter image description here

您可以在matplotlib模块中找到PCA函数:

import numpy as np
from matplotlib.mlab import PCA

data = np.array(np.random.randint(10,size=(10,3)))
results = PCA(data)

结果将存储主成分分析的各种参数。 它来自matplotlib的mlab部分,这是与MATLAB语法兼容的层

编辑: 在博客nextgenetics上,我发现了一个很好的演示,演示了如何使用matplotlib mlab模块执行和显示PCA,玩得开心并查看该博客!

相关问题 更多 >

    热门问题