pandas：获取相关性高的列组合

5 投票

1 回答

7788 浏览

提问于 2025-04-28 12:45

我有一个数据集，里面有6列。我用pandas这个工具计算了它们之间的相关性矩阵，得到了以下结果：

               age  earnings    height     hours  siblings    weight
age       1.000000  0.026032  0.040002  0.024118  0.155894  0.048655
earnings  0.026032  1.000000  0.276373  0.224283  0.126651  0.092299
height    0.040002  0.276373  1.000000  0.235616  0.077551  0.572538
hours     0.024118  0.224283  0.235616  1.000000  0.067797  0.143160
siblings  0.155894  0.126651  0.077551  0.067797  1.000000  0.018367
weight    0.048655  0.092299  0.572538  0.143160  0.018367  1.000000

我想找出那些相关性大于0.5的列组合，但这些列不能是同一列。也就是说，我希望输出的结果类似于：

[('height', 'weight')]

我试着用for循环来实现这个，但我觉得这样做不是最合适或者最高效的方法：

correlated = []
for column1 in columns:
    for column2 in columns:
        if column1 != column2:
            correlation = df[column1].corr(df[column2])
            if correlation > 0.5 and (column2, column1) not in correlated:
                correlated.append((column1, column2))

在这里，df是我原始的数据框。这个方法可以输出我想要的结果：

[(u'height', u'weight')]

暂无标签

1 个回答

接下来我们来看看这个例子，使用的是numpy库，假设你已经有了一个相关性矩阵，存放在df里面：

import numpy as np

indices = np.where(df > 0.5)
indices = [(df.index[x], df.columns[y]) for x, y in zip(*indices)
                                        if x != y and x < y]

执行后，indices会包含以下内容：

[('height', 'weight')]

回答于 2025-04-28 由 Python大师

分享举报

pandas：获取相关性高的列组合

1 个回答

撰写回答