随机PCA中explained_variance_ratio_在sklearn 0.15.0中总和大于1

4 投票

1 回答

4706 浏览

提问于 2025-04-18 14:11

当我用 sklearn.__version__ 版本 0.15.0 运行这段代码时，得到了一个奇怪的结果：

import numpy as np
from scipy import sparse
from sklearn.decomposition import RandomizedPCA

a = np.array([[1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

s = sparse.csr_matrix(a)

pca = RandomizedPCA(n_components=20)
pca.fit_transform(s)

在 0.15.0 版本下，我得到：

>>> pca.explained_variance_ratio_.sum()
>>> 2.1214285714285697

而在 '0.14.1' 版本下，我得到：

>>> pca.explained_variance_ratio_.sum()
>>> 0.99999999999999978

这个和应该不大于 1

每个选定组件解释的方差百分比。如果没有设置 k，那么所有组件都会被存储，解释的方差总和等于 1.0。

这是怎么回事呢？

版本差异机器学习统计分析 scikit-learn 数据降维 explained_variance_ratio 随机PCA 方差解释

1 个回答

在0.14.1版本中，出现了一个bug，导致它的explained_variance_ratio_.sum()总是返回1.0，无论你提取多少个成分（也就是截断的数量）。在0.15.0版本中，这个问题在处理密集数组时被修复了，下面的例子可以说明这一点：

>>> RandomizedPCA(n_components=3).fit(a).explained_variance_ratio_.sum()
0.86786547849848206
>>> RandomizedPCA(n_components=4).fit(a).explained_variance_ratio_.sum()
0.95868429631268515
>>> RandomizedPCA(n_components=5).fit(a).explained_variance_ratio_.sum()
1.0000000000000002

你的数据有5个维度（5个成分解释了100%的方差）。

如果你尝试在稀疏矩阵上使用RandomizedPCA，你会得到：

DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.

在稀疏数据上使用RandomizedPCA是不正确的，因为我们无法在不破坏稀疏性的情况下对数据进行中心化，这可能会导致在处理较大规模的稀疏数据时内存占用过高。然而，PCA是需要进行中心化的。

TruncatedSVD可以在稀疏数据上给出正确的解释方差比（但要记住，它的工作方式与在密集数据上使用PCA并不完全相同）：

>>> TruncatedSVD(n_components=3).fit(s).explained_variance_ratio_.sum()
0.67711305361490826
>>> TruncatedSVD(n_components=4).fit(s).explained_variance_ratio_.sum()
0.8771350212934137
>>> TruncatedSVD(n_components=5).fit(s).explained_variance_ratio_.sum()
0.95954459082530097

回答于 2025-04-18 由 Python大师

分享举报

随机PCA中explained_variance_ratio_在sklearn 0.15.0中总和大于1

1 个回答

撰写回答