Python 多重简单线性回归

Question

请注意，这不是关于多重回归的问题，而是关于在Python/NumPy（2.7）中多次进行简单（单变量）回归的问题。

我有两个 m 行 n 列的数组 x 和 y。这些行是相互对应的，每一对就是一组测量的 (x,y) 点。也就是说，plt.plot(x.T, y.T, '.') 会绘制出 m 个数据集/测量结果。

我想知道进行 m 次线性回归的最佳方法是什么。目前我是在循环每一行，并使用 scipy.stats.linregress()。假设我不想用矩阵的线性代数方法，而是想用这个函数，或者类似的黑箱函数。我可以尝试 np.vectorize，但文档显示它也是在循环。

经过一些实验，我还发现可以使用列表推导和 map() 来得到正确的结果。我把这两种解决方案放在下面。在IPython中，使用一个小数据集（已注释掉）时，`%%timeit` 的返回结果是：

(loop) 1000 loops, best of 3: 642 µs per loop
(map) 1000 loops, best of 3: 634 µs per loop

为了放大这个效果，我制作了一个更大的随机数据集（维度为 trials x trials）：

(loop, trials = 1000)  1 loops, best of 3: 299 ms per loop
(loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop
(map, trials = 1000)   1 loops, best of 3: 256 ms per loop
(map, trials = 10000)  1 loops, best of 3: 2.37 s per loop

在一个非常大的数据集上，这样的速度提升还不错，但我原本期待能更快一些。有没有更好的方法呢？

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
np.random.seed(42)
#y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8)))
#x = np.tile(np.arange(4), (3,1))
trials = 1000
y = np.random.rand(trials,trials)
x = np.tile(np.arange(trials), (trials,1))
num_rows = shape(y)[0]
slope = np.zeros(num_rows)
inter = np.zeros(num_rows)
for k, xrow in enumerate(x):
    yrow = y[k,:]
    slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow)
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope + intercept)
# Can the loop be removed?
tempx = [x[k,:] for k in range(num_rows)]
tempy = [y[k,:] for k in range(num_rows)]
results = np.array(map(stats.linregress, tempx, tempy))
slope_vec = results[:,0]
inter_vec = results[:,1]
#plt.plot(x.T, y.T, '.')
#plt.hold = True
#plt.plot(x.T, x.T*slope_vec + inter_vec)
print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec)
print "Inters equal by both methods?: ", np.allclose(inter, inter_vec)

性能优化 numpy 列表推导线性代数数据集多重回归简单线性回归测量结果

Python 多重简单线性回归

1 个回答

撰写回答