更快计算特殊相关距离矩阵

2 投票
1 回答
1952 浏览
提问于 2025-04-18 15:37

我想用皮尔逊相关距离来构建一个距离矩阵。首先,我尝试了 scipy.spatial.distance.pdist(df,'correlation'),这个方法对于我有5000行和20个特征的数据集来说非常快。

因为我想做一个推荐系统,所以我想稍微调整一下距离的计算,只考虑那些在两个用户中都不是NaN的特征。实际上,当遇到任何值为float('nan')的特征时,scipy.spatial.distance.pdist(df,'correlation')会输出NaN。

下面是我的代码,df是我那个5000*20的pandas数据框。

dist_mat = []
d = df.shape[1]
for i,row_i in enumerate(df.itertuples()):
    for j,row_j in enumerate(df.itertuples()):
        if i<j:
            print(i,j)
            ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)]
            dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind]))

这段代码可以运行,但和 scipy.spatial.distance.pdist(df,'correlation') 相比,速度慢得惊人。我的问题是:我该如何改进我的代码,让它运行得更快?或者我可以在哪里找到一个库,能够计算两个向量之间的相关性,并且只考虑在两个向量中都出现的特征?

谢谢大家的回答。

1 个回答

2

我觉得你需要用Cython来实现这个,下面是一个例子:

#cython: boundscheck=False, wraparound=False, cdivision=True

import numpy as np

cdef extern from "math.h":
    bint isnan(double x)
    double sqrt(double x)

def pair_correlation(double[:, ::1] x):
    cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
    cdef double u, v
    cdef int i, j, k, count
    cdef double du, dv, d, n, r
    cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv

    for i in range(x.shape[0]):
        for j in range(i, x.shape[0]):
            sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
            count = 0            
            for k in range(x.shape[1]):
                u = x[i, k]
                v = x[j, k]
                if u == u and v == v:
                    sum_u += u
                    sum_v += v
                    sum_u2 += u*u
                    sum_v2 += v*v
                    sum_uv += u*v
                    count += 1
            if count == 0:
                res[i, j] = res[j, i] = -9999
                continue

            um = sum_u / count
            vm = sum_v / count
            n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
            du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count) 
            dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
            r = 1 - n / (du * dv)
            res[i, j] = res[j, i] = r
    return res.base

要检查没有NAN的输出:

import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))

要检查有NAN的输出:

x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)

i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12

撰写回答