如何在Python中创建时间序列数据的线性回归预测?

2024-05-16 04:00:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要能够创建一个python函数,用于基于线性回归模型的预测,该模型对时间序列数据具有置信区间:

函数需要一个参数来指定要预测的距离。例如1天、7天、30天、90天等。根据不同的论证,需要创建带有置信区间的霍尔特温特斯预测:

我的时间序列数据如下:

print series

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]

函数应将预测值附加到上述称为“序列”和返回序列的时间序列数据中:

[{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]},{"target": "Forecast", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Upper", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]},{"target": "Lower", "datapoints": [[186.77999925613403, 1435520801], [178.95000147819519, 1435521131]]}]

有人用python做过这样的事情吗?有什么办法开始吗?


Tags: 数据函数模型距离target参数时间序列
3条回答

在你的问题中,你清楚地表明你希望 回归输出以及输出的上下限 预测。你还提到使用Holt-Winters算法 特别是预测。

其他应答者建议的包是有用的,但是您可能会注意到 该sklearnLinearRegression没有给您“超出 “statsmodels”这个框可以not provide Holt-Winters right now

因此,我建议尝试使用霍尔特温特斯的this implementation。 不幸的是它的许可证不清楚,所以我不能在这里复制 已满。现在,我不确定你是否真的想要霍尔特·温特斯 (季节性)预测,或霍尔特线性指数平滑 算法。我猜是后者给了这个职位的头衔。因此, 您可以使用链接库的linear()函数。这个 对于感兴趣的读者来说,技巧是described in detail here

为了不提供只提供链接的答案,我将描述 这里的主要特点。定义了一个函数来获取数据,即

 def linear(x, fc, alpha = None, beta = None):

x是要匹配的数据,fc是所需的时间步数 为了预测,alpha和beta采用了通常的Holt-Winters含义: 大致是一个参数,用于将平滑度控制到“级别” 并分别走向“潮流”。如果alphabeta不是 指定时,使用^{}估计它们 最大限度地降低RMSE。

函数通过循环遍历 现有数据点,然后按如下方式返回预测:

 return Y[-fc:], alpha, beta, rmse

其中Y[-fc:]是预测点,alphabeta是 实际使用的值,rmse是均方根误差。 不幸的是,正如你所见,没有上下的信心 间隔。顺便说一下,我们应该把它们称为prediction intervals

预测区间数学

Holt算法和Holt-Winters算法是指数平滑的 生成预测的技术和确定置信区间 对他们来说是个棘手的问题。它们被称为"rule of thumb"方法,在Holt-Winters乘法的情况下 算法,不带"underlying statistical model"。然而 final footnote to this page断言:

It is possible to calculate confidence intervals around long-term forecasts produced by exponential smoothing models, by considering them as special cases of ARIMA models. (Beware: not all software calculates confidence intervals for these models correctly.) The width of the confidence intervals depends on (i) the RMS error of the model, (ii) the type of smoothing (simple or linear); (iii) the value(s) of the smoothing constant(s); and (iv) the number of periods ahead you are forecasting. In general, the intervals spread out faster as α gets larger in the SES model and they spread out much faster when linear rather than simple smoothing is used.

我们看到hereARIMA(0,2,2)模型等价于Holt 加性误差线性模型

预测间隔代码(即如何进行)

你在评论中指出你"can easily do this in R"。我 您可能已经习惯了 forecast包在R中,因此需要这样的间隔。在 哪种情况-您可以调整python库,以便在 同样的基础。

查看R ^{} code,我们可以看到它返回一个对象 基于forecast(ets(...)。在引擎盖下-这最终要求 this function ^{},返回平均值mu和方差 var(以及cj我不得不承认我不明白)。 方差用于计算上下界here

要在Python中做类似的事情,我们需要 类似于class1R函数,它估计每个 预测。此函数将模型拟合和 在每个时间步将它们乘以一个因子,得到 那个时间步。在线性Holt算法的特殊情况下,因子是alpha + k*beta的累积和 其中k是时间步的预测数。一旦你有了它 每个预测点的方差,将误差视为正常值 从正态分布中得到X%的值。

下面是一个如何在Python中实现这一点的想法(使用我链接为 你的线性函数)

#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967

#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288

# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
    forecast_timesteps = forecast_days*timesteps_per_day
    middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
    cum_error = [beta+alpha]
    for k in range(1,forecast_timesteps):
        cum_error.append(cum_error[k-1] + k*beta + alpha)

    cum_error = np.array(cum_error)
    #Use some numpy multiplication to get the intervals
    var = cum_error * rmse**2
    # find the correct ppf on the normal distribution (two-sided)
    p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
    interval = np.sqrt(var) * p
    upper = middle_predictions + interval
    lower = middle_predictions - interval
    fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]

    ret_value = []

    ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
    ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
    ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
    return ret_value

if __name__=='__main__':
    import numpy as np
    import scipy.stats
    from math import sqrt

    null = None
    data_in = [{"target": "average", "datapoints": [[null, 1435688679],
    [34.870499801635745, 1435688694], [null, 1435688709], [null,
    1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
    [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
    1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
    [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
    1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
    [38.180000209808348, 1435688994], [null, 1435689009], [null,
    1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
    [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
    1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
    [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
    1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
    [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
    [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
    1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
    [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
    1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
    [null, 1435689549], [null, 1435689564]]}]

    #translate the data.  There may be better ways if you're
    #prepared to use pandas / input data is proper json
    time_series = data_in[0]["datapoints"]

    epoch_in = []
    Y_observed = []

    for (y,x) in time_series:
        if y and x:
            epoch_in.append(x)
            Y_observed.append(y)

    #Pass in the number of days to forecast
    fcast_days = 30
    res = holt_predict(Y_observed,epoch_in,fcast_days)
    data_out = data_in + res
    #data_out now holds the data as you wanted it.

    #Optional plot of results
    import matplotlib.pyplot as plt
    plt.plot(epoch_in,Y_observed)
    m,tstamps = zip(*res[0]['datapoints'])
    u,tstamps = zip(*res[1]['datapoints'])
    l,tstamps = zip(*res[2]['datapoints'])
    plt.plot(tstamps,u, label='upper')
    plt.plot(tstamps,l, label='lower')
    plt.plot(tstamps,m, label='mean')
    plt.show()

N.B.我给出的输出将点作为tuple类型添加到对象中。如果您真的需要list,那么将zip(upper,fcast_timestamps)替换为map(list,zip(upper,fcast_timestamps)),其中代码向结果添加upperlowerForecast指令。

此代码用于霍尔特线性算法的特殊情况-它不是计算正确预测间隔的通用方法。

重要提示

您的示例输入数据似乎有很多null,并且只有3个真正的 数据点。这不太可能是一个很好的基础 时间序列预测-尤其是他们似乎都是15分钟,你试图预测多达3个月!。如果你把数据输入 holt(),它会说:

You've got to be joking. I need more data!

我假设你有一个更大的数据集要测试。我在2015年的股市开盘价上尝试了上面的代码,它似乎给出了合理的结果(见下文)。

Code used on 2015 stock market opening prices

你可能认为预测间隔看起来有点宽。This blog from the author of the R forecast module意味着这是有意的,尽管:

"con­fi­dence inter­vals for the mean are much nar­rower than pre­dic­tion inter­vals"

注意:这不是一个详细的规范答案,而是对适用于您的域的可用库(统计模型)的引用。

对于python,可以使用:

有一些好文章:

Scikit是python的一个很好的模块

X和Y变量必须分成两个数组,如果要绘制它们(X,Y),其中一个数组的索引将与另一个数组匹配。

所以我猜在你的时间序列数据中,你会把X和Y值分开如下:

null = None
time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]

 # assuming the time series is the X axis value

input_X_vars = []
input_Y_vars = []

for pair in time_series[0]["datapoints"]:
    if (pair[0] != None and pair[1] != None):
        input_X_vars.append(pair[1])
        input_Y_vars.append(pair[0])



import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

regr = linear_model.LinearRegression()
regr.fit(input_X_vars, input_Y_vars)

test_X_vars = [1435688681, 1435688685]

results = regr.predict(test_X_vars)

forecast_append = {"target": "Lower", "datapoints": results}

time_series.append(forecast_append)

我将null设置为None,因为在python中,“null”关键字被解析为一个简单的变量。

相关问题 更多 >