ValueError:无法将字符串转换为浮点数:'Curtis RIngraham Directge

-2 投票
1 回答
20 浏览
提问于 2025-04-14 17:17

我正在进行数据拆分和交叉验证的工作。对于数据拆分,我需要只提取测试数据集,其他的数据保持不变,以便进行交叉验证。但是在交叉验证结束时,我遇到了一个错误:ValueError: could not convert string to float: 'Curtis RIngraham Directge'。我该如何解决这个问题呢?

数据拆分

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract our test data and store it in x_test, y_test
features = features_df.to_numpy()
labels = labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = [] # this is an inefficient way but still do it

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))

交叉验证

best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():

    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]

    # Let's store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []

    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)

        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)

        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train)

        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)

        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)

    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k

    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")

    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k

    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")

    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

我尝试查找是否还有剩余的 x_train、y_train、x_valid 和 y_valid 字符串值,但没有找到。

1 个回答

0

这可能是因为你的数据集中有一些列包含分类数据。首先,你可以使用方法1或方法2将这些分类数据转换成数字:

方法1:

#Turn the categories into numbers

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

categories = ["col1", "col2","col3","col4"]---Columns which have categorical values

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categories)],
                                   remainder="passthrough")

transformed_X = transformer.fit_transform(X)

方法2:你可以将特定列的值转换成整数。

import warnings

warnings.filterwarnings('ignore')

df['col1']=pd.get_dummies(df['col1'], drop_first=True)

撰写回答