基于组变量的列车试验分离学习

Unique ID. Exp start date. Value. Status. 001 01/01/2020. 4000. Closed 001 12/01/2019 4000. Archived 002 01/01/2020. 5000. Closed 002 12/01/2019 5000. Archived

2条回答

网友

1楼 · 编辑于 2024-06-08 07:34:13

很好train_test_split有stratify参数。如果将其设置为X['Unique ID']，则表示无法在训练集和测试集中找到唯一的id

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=df['Unique ID'].values)

网友

2楼 · 编辑于 2024-06-08 07:34:13

我相信你需要GroupShuffleSplit（documentation here）

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
print(groups.shape)

gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

for train_idx, test_idx in gss.split(X, y, groups):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]

从上面可以看出，列车/测试索引是基于groups变量创建的

在您的情况下，Unique ID.应该用作组

相关问题更多 >

编程相关推荐

热门问题

热门文章

基于组变量的列车试验分离学习

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >