Python中R的createDataPartition等效函数

4 投票

4 回答

2427 浏览

提问于 2025-04-29 22:11

我想在Python中实现R语言的createDataPartition函数的功能。我有一个机器学习的数据集，里面有一个布尔类型的目标变量。我想把这个数据集分成训练集（60%）和测试集（40%）。

如果我完全随机地分割数据，目标变量在这两个集合中的分布就会不均匀。

在R中，我是这样做的：

inTrain <- createDataPartition(y=data$repeater, p=0.6, list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]

那我该如何在Python中做到这一点呢？

附注：我使用的是scikit-learn作为我的机器学习库，还有Python的pandas库。

暂无标签

4 个回答

正如评论中提到的，选中的答案没有保持数据的类别分布。scikit-learn的文档指出，如果需要保持类别分布的话，就应该使用StratifiedShuffleSplit。这可以通过在使用train_test_split方法时，将你的目标数组传递给stratify选项来实现。

>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split

>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)

>>> # show counts of each type after split
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))

回答于 2025-04-29 由 Python大师

分享举报

这个回答是不对的。显然，Python里没有一个可以进行分层抽样的函数，像R语言中的DataPartition那样，而只是随机抽样。

回答于 2025-04-29 由 Python大师

分享举报

正确的答案是 sklearn.model_selection.StratifiedShuffleSplit

Stratified ShuffleSplit 是一种交叉验证工具

它提供了训练和测试的索引，用来把数据分成训练集和测试集。

这个交叉验证工具结合了 StratifiedKFold 和 ShuffleSplit，能够返回分层随机的折叠数据。折叠数据的生成是通过保持每个类别样本的比例来实现的。

注意：和 ShuffleSplit 策略一样，分层随机分割并不能保证所有的折叠数据都是不同的，尽管在较大的数据集上，这种情况还是很有可能发生的。

回答于 2025-04-29 由 Python大师

分享举报

在scikit-learn这个工具包里，你可以使用一个叫做 train_test_split 的功能。

from sklearn.cross_validation import train_test_split
from sklearn import datasets

# Use Age and Weight to predict a value for the food someone chooses
X_train, X_test, y_train, y_test = train_test_split(table['Age', 'Weight'], 
                                                    table['Food Choice'], 
                                                    test_size=0.25)

# Another example using the sklearn pre-loaded datasets:
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)

这个功能可以把数据分成：

用于训练的输入数据
用于评估的数据输入
用于训练的输出数据
用于评估的输出数据

你还可以加一个参数：test_size=0.25，这样可以调整用于训练和测试的数据比例。

如果你想把一个数据集分开，可以用这样的方式来获取40%的测试数据：

>>> data = np.arange(700).reshape((100, 7))
>>> training, testing = train_test_split(data, test_size=0.4)
>>> print len(data)
100
>>> print len(training)
60
>>> print len(testing)
40

回答于 2025-04-29 由 Python大师

分享举报

Python中R的createDataPartition等效函数

4 个回答

撰写回答