如何在Python中创建带有缺失数据间隔的趋势线？

1 投票

1 回答

3003 浏览

提问于 2025-04-18 12:33

我刚开始接触Python和数据分析，现在被要求制作一个散点图。我用的数据集中有很多元素是None值。当我使用polyfit方法来创建趋势线（也就是最优拟合线）时，遇到了None值的错误。我试过用列表和numpy数组，但效果很差。我还尝试了masked_array、masked_invalid等多种配置，但结果都是数组里全是None值。有没有办法创建趋势线，而不需要去掉那些包含None值的元素？我需要保留它们，以确保我的图表尺寸正确。我使用的是Python 2.7。这是我目前的代码：

import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
import pylab
#The InterpolatedUnivariateSpline method popped up during my endeavor 
#to extrapolate the trendline through the gaps in data.
#To be honest, I don't think its doing anything for me...
from scipy.interpolate import InterpolatedUnivariateSpline  

fig, ax = plt.subplots(1,1)
ax.scatter(y, dbm, color = 'purple', marker = 'o', s = 100)
plt.xlim(min(y), max(y)) 
plt.xlabel('Temp - C')
dbm_array = np.asarray(dbm) #dbm and y are lists earlier in the program
y_array = np.asarray(y)

x = np.linspace(min(y), max(y), len(y))
order = 1
s = InterpolatedUnivariateSpline(y, dbm, k=order)
blah = s(x)
plt.plot(y, blah, '--k')

出于某种原因，这段代码给我生成了散点图，但没有趋势线。没有错误，所以我想这部分是没问题的……非常感谢你的帮助！

数据可视化 numpy 数据分析散点图 polyfit 趋势线缺失数据 masked_array

1 个回答

首先，如果你有数组，里面不应该有 None，只应该有 nan。这是因为 None 是一个对象，不能用数字来表示。所以，问题可能出在这里。我们来看看：

import numpy as np

a = np.array([None, 1, 2, 3, 4, None])

我们得到了什么？

>>> a
array([None, 1, 2, 3, 4, None], dtype=object)

这肯定不是我们想要的。这是一个对象数组，通常这并没有什么用。你无法对它进行任何计算：

>>> 2*a
unsupported operand type(s) for *: 'int' and 'NoneType'

之所以会这样，是因为逐元素相乘时试图计算 2*None。

所以，你真正想要的是：

>>> a = np.array([np.nan, 1, 2, 3, 4, np.nan])
>>> a
array([ nan,   1.,   2.,   3.,   4.,  nan])
>>> a.dtype
dtype('float64')
>>> 2 * a
array([ nan,   2.,   4.,   6.,   8.,  nan])

现在一切都按预期工作了。

所以，第一件事是检查你的输入数组是否格式正确。如果你在曲线拟合时遇到问题，可以创建一个没有讨厌的 nan 的数组：

import numpy as np

a = np.array([[0,np.nan], [1, 1], [2, 1.5], [3.2, np.nan], [4, 5]])
b = a[-np.isnan(a[:,1])]

让我们看看 a 和 b 的内容：

>>> a
array([[ 0. ,  nan],
       [ 1. ,  1. ],
       [ 2. ,  1.5],
       [ 3.2,  nan],
       [ 4. ,  5. ]])
>>> b
array([[ 1. ,  1. ],
       [ 2. ,  1.5],
       [ 4. ,  5. ]])

这就是你想要的。曲线是用没有任何 nan 的 b 拟合的，而 nan 有一种四处游荡的习惯，会导致计算结果变成 nan。（这是设计使然。）

那么这怎么运作呢？ np.isnan(a[:,1]) 返回一个布尔数组，在 a 的第1列中每个 nan 的位置上是 True，有效数字的位置上是 False。因为这正好是我们不想要的，所以我们在前面加上负号来取反。然后索引只选择那些有数字的行。

如果你的 X 数据和 Y 数据在两个不同的 1-D 向量中，可以这样做：

# original y data: Y
# original x data: X
# both have the same length

# calculate a mask to be used (a boolean vector)
msk = -np.isnan(Y)

# use the mask to plot both X and Y only at the points where Y is not NaN
plot(X[msk], Y[msk])

在某些情况下，你可能根本没有 X 数据，但你想从 0 开始给点编号（就像 matplotlib 只给一个向量时那样）。有几种方法可以做到，但这是其中一种：

msk = -np.isnan(Y)
X = np.arange(len(Y))
plot(X[msk], Y[msk])

回答于 2025-04-18 由 Python大师

分享举报

如何在Python中创建带有缺失数据间隔的趋势线？

1 个回答

撰写回答