如何将一个数据集分割成训练和测试数据集，例如交叉验证？

网友

1楼 · 编辑于 2024-04-26 04:27:22

只是个便条。如果需要训练、测试和验证集，可以执行以下操作：

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将提供70%的训练，15%的测试和val集。希望这有帮助。

网友

2楼 · 编辑于 2024-04-26 04:27:22

还有一个选择，就是使用scikit learn。作为scikit's wiki describes，您只需使用以下说明：

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

这样，你就可以保持同步的标签为数据，你正试图分裂成训练和测试。

网友

3楼 · 编辑于 2024-04-26 04:27:22

如果要将数据集一分为二，则可以使用numpy.random.shuffle；如果需要跟踪索引，则可以使用numpy.random.permutation：

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

或者

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

有很多方法可以repeatedly partition the same data set for cross validation。一种策略是从数据集中重新采样，重复：

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最后，sklearn包含several cross validation methods（k-fold，leave-n-out，…）。它还包括更高级的"stratified sampling"方法，这些方法创建的数据分区相对于某些特性是平衡的，例如，确保在训练和测试集中有相同比例的正示例和负示例。

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将一个数据集分割成训练和测试数据集，例如交叉验证？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >