<p>在时间序列数据集上,数据分割以不同的方式进行。<a href="http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/" rel="noreferrer">See this link</a>了解更多信息。或者,您可以从scikit学习包中尝试<a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html" rel="noreferrer">TimeSeriesSplit</a>。所以主要的想法是,假设根据时间戳有10个数据点。现在分裂将如下:</p>
<pre><code>Split 1 :
Train_indices : 1
Test_indices : 2
Split 2 :
Train_indices : 1, 2
Test_indices : 3
Split 3 :
Train_indices : 1, 2, 3
Test_indices : 4
Split 4 :
Train_indices : 1, 2, 3, 4
Test_indices : 5
</code></pre>
<p>等等等等。您可以查看上面链接中显示的示例,以更好地了解TimeSEriesSPlit在sklearn中的工作方式</p>
<p><strong>更新</strong>
如果您有一个单独的时间列,您可以简单地基于该列对数据进行排序,并应用上面提到的timeSeriesSplit来获取拆分。</p>
<p>为了确保最终拆分中67%的培训和33%的测试数据,请指定拆分次数如下:</p>
<pre><code>no_of_split = int((len(data)-3)/3)
</code></pre>
<p>示例</p>
<pre><code>X = np.array([[1, 2], [3, 4], [1, 2], [3, 4],[1, 2], [3, 4],[3, 4],[1, 2], [3, 4],[3, 4],[1, 2], [3, 4] ])
y = np.array([1, 2, 3, 4, 5, 6,7,8,9,10,11,12])
tscv = TimeSeriesSplit(n_splits=int((len(y)-3)/3))
for train_index, test_index in tscv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
#To get the indices
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
</code></pre>
<p>输出:</p>
<pre><code>('TRAIN:', array([0, 1, 2]), 'TEST:', array([3, 4, 5]))
('TRAIN:', array([0, 1, 2, 3, 4, 5]), 'TEST:', array([6, 7, 8]))
('TRAIN:', array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 'TEST:', array([ 9, 10, 11]))
</code></pre>
<p></p>