如何在两个二维numpy数组之间执行线性/非线性回归并用matplotlib可视化？

Question

首先，我想说明一下，我需要对一个大国家的数据进行回归分析，这些数据涉及一种疾病和一些环境因素，所以我有很多数据。

现在，我把这些数据存储在 tiff 文件中，并通过 gdal 读取到 numpy 数组里。每个数据集被读取成一个形状为 <54L,53L> 的 numpy 数组。我有多个这样的数组，每个数据集都有。接下来，我需要在这两个二维 numpy 数组之间进行回归分析。数组中的值是 Float64 类型。举个例子：

[[ 162.32145691  158.19345093  153.15704346 ...,  123.77481079 123.63883972  123.6770401 ]
 [ 164.55152893  160.59266663  155.75968933 ...,  121.28504181  121.1164093  121.16275024] ..., 
 [ 321.38272095  329.53326416  338.85699463 ...,  193.69404602   192.50938416  191.42672729]]

比如说，疾病数据集和环境因素1之间的关系，疾病数据集和环境因素2之间的关系等等。由于这些关系比较复杂且不太明确，我想先把这两个二维数组画出来，但我找不到合适的方法。

那么，我该如何在 matplotlib 中绘制这两个二维数组的散点图呢？ 我说散点图是因为这样我更容易推断出它们之间的关系，然后再选择合适的回归模型（线性、非线性、对数等）。我用以下代码逐行绘制每个 numpy 数组之间的关系：

for i in range(55):
    plt.scatter(JanTemp[i],can02[i])    
    plt.title('Disease vs Temperature')
    plt.ylabel('DiseaseCases')
    plt.xlabel('Temp')
    plt.show()

在这里，can02 是响应变量，JanTemp 是预测变量。正如我所预期的，我得到了54个连续的图表，并且两个变量的颜色是一样的，这让我很沮丧（这是我第一次使用 matplotlib，我不知道怎么让每个变量有不同的颜色）。 有没有更好的方法呢？如果有，请建议一下。 我觉得可以用三维可视化，但那样我又该如何从中推断呢？所以请建议一种在二维空间中更好的可视化方法。

由于从图表中得不到太多信息，我决定先从线性回归开始。我使用 scipy.stats.linregress，像上面那样对每一行进行迭代，方法如下：

months =[JanTemp,FebTemp,MarTemp1,AprTemp,MayTemp,JunTemp,JulTemp,AugTemp,SepTemp,OctTemp,NovTemp,DecTemp]
for month in months:
    csum=0
    pcsum=0
    for i in range(54):
            slope, intercept, r_value, p_value, std_err = stats.linregress(month[i],can02[i])
            csum +=r_value
            pcsum += (r_value**2)*100
    print "mean correlation coefficient is", csum/53
    print "The avg COD is", pcsum/53

在这里，JanTemp、FebTemp 等等是每个维度为 54,53 的文件。对于每个文件，我进行53次行对行的回归。这也显得有些单调。 有没有更好的方法，比如函数、模块等？

我知道的另一种方法是使用 statsmodels.api 模块的普通最小二乘法（OLS），方法如下：

y = can02
x = JanTemp
X = sm.add_constant(x) #Adds a constant to the linear eq of regression
est = sm.OLS(y, X) #OLS performs the regression of predictor on response
est = est.fit() #fit object of OLS fits the mode
est.summary() #Gives the summary of whole calculation
est.params #gives the coefficient of regression

但我遇到了以下长长的错误信息：

Traceback (most recent call last):
  File "H:\Python\results.py", line 77, in <module>
est.summary() #Gives the summary of whole calculation
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 1230, in summary
top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
  File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 959, in rsquared
return 1 - self.ssr/self.centered_tss
  File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 931, in ssr
return np.dot(wresid, wresid)
ValueError: matrices are not aligned

我不明白为什么矩阵没有对齐。无论如何，回到我最初的问题，有没有其他类似的方法可以进行回归，我该如何在二维数组上进行操作呢？ 谢谢，我知道我在这个长问题上占用了你们很多宝贵的时间，但我想说得清楚。我在这个网站和其他网站上搜索了很多问题，但没有找到合适或相关的解决方案。谢谢。

数据可视化 numpy 散点图回归分析线性回归非线性回归普通最小二乘法环境因素

如何在两个二维numpy数组之间执行线性/非线性回归并用matplotlib可视化？

1 个回答

撰写回答