如何在Python中创建时间序列数据的线性回归预测?

3条回答

网友

1楼 · 编辑于 2024-05-16 04:00:57

在你的问题中，你清楚地表明你希望回归输出以及输出的上下限预测。你还提到使用Holt-Winters算法特别是预测。

其他应答者建议的包是有用的，但是您可能会注意到该sklearnLinearRegression没有给您“超出 “statsmodels”这个框可以not provide Holt-Winters right now。

因此，我建议尝试使用霍尔特温特斯的this implementation。不幸的是它的许可证不清楚，所以我不能在这里复制已满。现在，我不确定你是否真的想要霍尔特·温特斯（季节性）预测，或霍尔特线性指数平滑算法。我猜是后者给了这个职位的头衔。因此，您可以使用链接库的linear()函数。这个对于感兴趣的读者来说，技巧是described in detail here。

为了不提供只提供链接的答案，我将描述这里的主要特点。定义了一个函数来获取数据，即

 def linear(x, fc, alpha = None, beta = None):

x是要匹配的数据，fc是所需的时间步数为了预测，alpha和beta采用了通常的Holt-Winters含义：大致是一个参数，用于将平滑度控制到“级别” 并分别走向“潮流”。如果alpha或beta不是指定时，使用^{}估计它们最大限度地降低RMSE。

函数通过循环遍历现有数据点，然后按如下方式返回预测：

 return Y[-fc:], alpha, beta, rmse

其中Y[-fc:]是预测点，alpha和beta是实际使用的值，rmse是均方根误差。不幸的是，正如你所见，没有上下的信心间隔。顺便说一下，我们应该把它们称为prediction intervals。

预测区间数学

Holt算法和Holt-Winters算法是指数平滑的生成预测的技术和确定置信区间对他们来说是个棘手的问题。它们被称为"rule of thumb"方法，在Holt-Winters乘法的情况下算法，不带"underlying statistical model"。然而 final footnote to this page断言：

It is possible to calculate confidence intervals around long-term forecasts produced by exponential smoothing models, by considering them as special cases of ARIMA models. (Beware: not all software calculates confidence intervals for these models correctly.) The width of the confidence intervals depends on (i) the RMS error of the model, (ii) the type of smoothing (simple or linear); (iii) the value(s) of the smoothing constant(s); and (iv) the number of periods ahead you are forecasting. In general, the intervals spread out faster as α gets larger in the SES model and they spread out much faster when linear rather than simple smoothing is used.

我们看到hereARIMA（0,2,2）模型等价于Holt 加性误差线性模型

预测间隔代码（即如何进行）

你在评论中指出你"can easily do this in R"。我您可能已经习惯了 forecast包在R中，因此需要这样的间隔。在哪种情况-您可以调整python库，以便在同样的基础。

查看R ^{} code，我们可以看到它返回一个对象基于forecast(ets(...)。在引擎盖下-这最终要求 this function ^{}，返回平均值mu和方差 var（以及cj我不得不承认我不明白）。方差用于计算上下界here。

要在Python中做类似的事情，我们需要类似于class1R函数，它估计每个预测。此函数将模型拟合和在每个时间步将它们乘以一个因子，得到那个时间步。在线性Holt算法的特殊情况下，因子是alpha + k*beta的累积和其中k是时间步的预测数。一旦你有了它每个预测点的方差，将误差视为正常值从正态分布中得到X%的值。

下面是一个如何在Python中实现这一点的想法（使用我链接为你的线性函数）

#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967

#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288

# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
    forecast_timesteps = forecast_days*timesteps_per_day
    middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
    cum_error = [beta+alpha]
    for k in range(1,forecast_timesteps):
        cum_error.append(cum_error[k-1] + k*beta + alpha)

    cum_error = np.array(cum_error)
    #Use some numpy multiplication to get the intervals
    var = cum_error * rmse**2
    # find the correct ppf on the normal distribution (two-sided)
    p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
    interval = np.sqrt(var) * p
    upper = middle_predictions + interval
    lower = middle_predictions - interval
    fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]

    ret_value = []

    ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
    ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
    ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
    return ret_value

if __name__=='__main__':
    import numpy as np
    import scipy.stats
    from math import sqrt

    null = None
    data_in = [{"target": "average", "datapoints": [[null, 1435688679],
    [34.870499801635745, 1435688694], [null, 1435688709], [null,
    1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
    [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
    1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
    [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
    1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
    [38.180000209808348, 1435688994], [null, 1435689009], [null,
    1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
    [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
    1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
    [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
    1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
    [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
    [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
    1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
    [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
    1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
    [null, 1435689549], [null, 1435689564]]}]

    #translate the data.  There may be better ways if you're
    #prepared to use pandas / input data is proper json
    time_series = data_in[0]["datapoints"]

    epoch_in = []
    Y_observed = []

    for (y,x) in time_series:
        if y and x:
            epoch_in.append(x)
            Y_observed.append(y)

    #Pass in the number of days to forecast
    fcast_days = 30
    res = holt_predict(Y_observed,epoch_in,fcast_days)
    data_out = data_in + res
    #data_out now holds the data as you wanted it.

    #Optional plot of results
    import matplotlib.pyplot as plt
    plt.plot(epoch_in,Y_observed)
    m,tstamps = zip(*res[0]['datapoints'])
    u,tstamps = zip(*res[1]['datapoints'])
    l,tstamps = zip(*res[2]['datapoints'])
    plt.plot(tstamps,u, label='upper')
    plt.plot(tstamps,l, label='lower')
    plt.plot(tstamps,m, label='mean')
    plt.show()

N.B.我给出的输出将点作为tuple类型添加到对象中。如果您真的需要list，那么将zip(upper,fcast_timestamps)替换为map(list,zip(upper,fcast_timestamps))，其中代码向结果添加upper、lower和Forecast指令。

此代码用于霍尔特线性算法的特殊情况-它不是计算正确预测间隔的通用方法。

重要提示

您的示例输入数据似乎有很多null，并且只有3个真正的数据点。这不太可能是一个很好的基础时间序列预测-尤其是他们似乎都是15分钟，你试图预测多达3个月！。如果你把数据输入 holt()，它会说：

You've got to be joking. I need more data!

我假设你有一个更大的数据集要测试。我在2015年的股市开盘价上尝试了上面的代码，它似乎给出了合理的结果（见下文）。

你可能认为预测间隔看起来有点宽。This blog from the author of the R forecast module意味着这是有意的，尽管：

"confidence intervals for the mean are much narrower than prediction intervals"

网友
2楼 · 编辑于 2024-05-16 04:00:57

注意：这不是一个详细的规范答案，而是对适用于您的域的可用库（统计模型）的引用。
对于python，可以使用：
剪刀：^{}
e、 g.http://glowingpython.blogspot.com/2012/03/linear-regression-with-numpy.html（使用scipy.stats.linregress）
statsmodels：这里是some examples。
sklearn：这是documentation。
有一些好文章：
http://davidcoallier.com/blog/linear-regression-from-r-to-python/，特别是它有一个关于预测的部分。
http://connor-johnson.com/2014/02/18/linear-regression-with-python/，另一篇文章，展示了如何使用statsmodels和sklearn。

网友
3楼 · 编辑于 2024-05-16 04:00:57

Scikit是python的一个很好的模块

X和Y变量必须分成两个数组，如果要绘制它们（X，Y），其中一个数组的索引将与另一个数组匹配。

所以我猜在你的时间序列数据中，你会把X和Y值分开如下：

null = None
time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]

 # assuming the time series is the X axis value

input_X_vars = []
input_Y_vars = []

for pair in time_series[0]["datapoints"]:
    if (pair[0] != None and pair[1] != None):
        input_X_vars.append(pair[1])
        input_Y_vars.append(pair[0])



import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

regr = linear_model.LinearRegression()
regr.fit(input_X_vars, input_Y_vars)

test_X_vars = [1435688681, 1435688685]

results = regr.predict(test_X_vars)

forecast_append = {"target": "Lower", "datapoints": results}

time_series.append(forecast_append)

我将null设置为None，因为在python中，“null”关键字被解析为一个简单的变量。

预测区间数学

预测间隔代码（即如何进行）

重要提示

相关问题更多 >

编程相关推荐

热门问题

热门文章