Pandas数据帧分层分解为训练集、验证集和测试集

网友
1楼 · 编辑于 2024-05-13 03:06:34

下面是一个Python函数，它使用分层采样将Pandas数据帧拆分为训练、验证和测试数据帧。它通过两次调用scikit learn的函数train_test_split()来执行此拆分
import pandas as pd from sklearn.model_selection import train_test_split def split_stratified_into_train_val_test(df_input, stratify_colname='y', frac_train=0.6, frac_val=0.15, frac_test=0.25, random_state=None): ''' Splits a Pandas dataframe into three subsets (train, val, and test) following fractional ratios provided by the user, where each subset is stratified by the values in a specific column (that is, each subset has the same relative frequency of the values in the column). It performs this splitting by running train_test_split() twice. Parameters df_input : Pandas dataframe Input dataframe to be split. stratify_colname : str The name of the column that will be used for stratification. Usually this column would be for the label. frac_train : float frac_val : float frac_test : float The ratios with which the dataframe will be split into train, val, and test data. The values should be expressed as float fractions and should sum to 1.0. random_state : int, None, or RandomStateInstance Value to be passed to train_test_split(). Returns - df_train, df_val, df_test : Dataframes containing the three splits. ''' if frac_train + frac_val + frac_test != 1.0: raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \ (frac_train, frac_val, frac_test)) if stratify_colname not in df_input.columns: raise ValueError('%s is not a column in the dataframe' % (stratify_colname)) X = df_input # Contains all columns. y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify. # Split original dataframe into train and temp dataframes. df_train, df_temp, y_train, y_temp = train_test_split(X, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_state) # Split the temp dataframe into val and test dataframes. relative_frac_test = frac_test / (frac_val + frac_test) df_val, df_test, y_val, y_test = train_test_split(df_temp, y_temp, stratify=y_temp, test_size=relative_frac_test, random_state=random_state) assert len(df_input) == len(df_train) + len(df_val) + len(df_test) return df_train, df_val, df_test
下面是一个完整的工作示例
考虑一个具有要执行分层的标签的数据集。此标签在原始数据集中有自己的分布，例如75%foo、15%bar和10%baz。现在，让我们使用60/20/20比率将数据集拆分为训练、验证和测试子集，其中每个拆分保留相同的标签分布。请参见下图：
以下是示例数据集：
df = pd.DataFrame( { 'A': list(range(0, 100)), 'B': list(range(100, 0, -1)), 'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } ) df.head() # A B label # 0 0 100 foo # 1 1 99 foo # 2 2 98 foo # 3 3 97 foo # 4 4 96 foo df.shape # (100, 3) df.label.value_counts() # foo 75 # bar 15 # baz 10 # Name: label, dtype: int64
现在，让我们从上面调用split_stratified_into_train_val_test()函数，按照60/20/20的比率来获得训练、验证和测试数据帧
df_train, df_val, df_test = \ split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)
三个数据帧df_train、df_val和df_test包含所有原始行，但它们的大小将遵循上述比率
df_train.shape #(60, 3) df_val.shape #(20, 3) df_test.shape #(20, 3)
此外，三个分裂中的每一个将具有相同的标签分布，即75%foo、15%bar和10%baz
df_train.label.value_counts() # foo 45 # bar 9 # baz 6 # Name: label, dtype: int64 df_val.label.value_counts() # foo 15 # bar 3 # baz 2 # Name: label, dtype: int64 df_test.label.value_counts() # foo 15 # bar 3 # baz 2 # Name: label, dtype: int64

网友
2楼 · 编辑于 2024-05-13 03:06:34

np.array_split
如果您想推广到n拆分，np.array_split是您的朋友（它可以很好地处理数据帧）
fractions = np.array([0.6, 0.2, 0.2]) # shuffle your input df = df.sample(frac=1) # split into 3 parts train, val, test = np.array_split( df, (fractions[:-1].cumsum() * len(df)).astype(int))
train_test_split
使用^{}进行分层拆分的多风解决方案
y = df.pop('diagnosis').to_frame() X = df
X_train, X_test, y_train, y_test = train_test_split( X, y,stratify=y, test_size=0.4) X_test, X_val, y_test, y_val = train_test_split( X_test, y_test, stratify=y_test, test_size=0.5)
其中X是功能的数据帧，y是标签的单列数据帧

网友
3楼 · 编辑于 2024-05-13 03:06:34

纯`pandas`溶液

按70/20/10%的比例分为培训/验证/测试：

train_df = df.sample(frac=0.7, random_state=random_seed)
tmp_df = df.drop(train_df.index)
test_df = tmp_df.sample(frac=0.33333, random_state=random_seed)
valid_df = tmp_df.drop(test_df.index)

assert len(df) == len(train_df) + len(valid_df) + len(test_df), "Dataset sizes don't add up"
del tmp_df

`np.array_split`

`train_test_split`

纯`pandas`溶液

相关问题更多 >

编程相关推荐

热门问题

热门文章