sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出界限”

2024-06-01 00:08:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着用Scikit learn的分层随机分割来分割样本数据集。我遵循了Scikit学习文档中显示的示例here

import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")

# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)

# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)

for train_index, test_index in sss:
    xtrain, xtest = data[train_index], data[test_index]
    ytrain, ytest = target[train_index], target[test_index]

# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()

但是,运行此脚本时,会出现以下错误:

IndexError: indices are out-of-bounds

有人能指出我在这里做错了什么吗?谢谢!


Tags: ofcsvfromtestimporttargetdataindex
1条回答
网友
1楼 · 发布于 2024-06-01 00:08:20

您遇到了熊猫索引与NumPy索引的不同约定。数组train_indextest_index是行索引的集合。但是data是一个PandasDataFrame对象,当您在该对象中使用单个索引(如data[train_index])时,Pandas希望train_index包含标签,而不是行索引。您可以使用.values将数据帧转换为NumPy数组:

data_array = data.values
for train_index, test_index in sss:
    xtrain, xtest = data_array[train_index], data_array[test_index]
    ytrain, ytest = target[train_index], target[test_index]

或者使用Pandas^{}访问器:

for train_index, test_index in sss:
    xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
    ytrain, ytest = target[train_index], target[test_index]

我倾向于采用第二种方法,因为它给出的是xtrainxtest类型的DataFrame,而不是ndarray,因此保留了列标签。

相关问题 更多 >