ValueError:无法将字符串转换为浮点数:'Curtis RIngraham Directge
我正在进行数据拆分和交叉验证的工作。对于数据拆分,我需要只提取测试数据集,其他的数据保持不变,以便进行交叉验证。但是在交叉验证结束时,我遇到了一个错误:ValueError: could not convert string to float: 'Curtis RIngraham Directge'。我该如何解决这个问题呢?
数据拆分
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
# First extract our test data and store it in x_test, y_test
features = features_df.to_numpy()
labels = labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)
# set k = 5
k = 5
kfold_spliiter = KFold(n_splits=k)
folds_data = [] # this is an inefficient way but still do it
fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
x_train , x_valid = _x[train_index,:],_x[validation_index,:]
y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
fold+=1
folds_data.append((x_train,y_train,x_valid,y_valid))
交叉验证
best_validation_accuracy = 0
best_model_name = ""
best_model = None
# Iterate over all models
for model_name in all_models.keys():
print (f"Evaluating {model_name} ...")
model = all_models[model_name]
# Let's store training and validation accuracies for all folds
train_acc_for_all_folds = []
valid_acc_for_all_folds = []
#Iterate over all folds
for i, fold in enumerate(folds_data):
x_train, y_train, x_valid, y_valid = fold
# Train the model
_ = model.fit(x_train,y_train.flatten())
# Evluate model on training data
y_pred_train = model.predict(x_train)
# Evaluate the model on validation data
y_pred_valid = model.predict(x_valid)
# Compute training accuracy
train_acc = accuracy_score(y_pred_train , y_train)
# Store training accuracy for each folds
train_acc_for_all_folds.append(train_acc)
# Compute validation accuracy
valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())
# Store validation accuracy for each folds
valid_acc_for_all_folds.append(valid_acc)
#average training accuracy across k folds
avg_training_acc = sum(train_acc_for_all_folds)/k
print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
#average validation accuracy across k folds
avg_validation_acc = sum(valid_acc_for_all_folds)/k
print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
# Select best model based on average validation accuracy
if avg_validation_acc > best_validation_accuracy:
best_validation_accuracy = avg_validation_acc
best_model_name = model_name
best_model = model
print (f"-----------------------------------")
print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")
我尝试查找是否还有剩余的 x_train、y_train、x_valid 和 y_valid 字符串值,但没有找到。
1 个回答
0
这可能是因为你的数据集中有一些列包含分类数据。首先,你可以使用方法1或方法2将这些分类数据转换成数字:
方法1:
#Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categories = ["col1", "col2","col3","col4"]---Columns which have categorical values
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
one_hot,
categories)],
remainder="passthrough")
transformed_X = transformer.fit_transform(X)
方法2:你可以将特定列的值转换成整数。
import warnings
warnings.filterwarnings('ignore')
df['col1']=pd.get_dummies(df['col1'], drop_first=True)