我是否在MachineLearning模型中正确使用onehotencoded功能？

2024-06-02 06:39:03 发布

男 | 程序猿一只，喜欢编程写python代码。

通过使用Python中的Sklearn包，我的RenadomForestRegressor模型有14个特性和1个标签。我在每列下有10000个数据，因此数组的大小Feature: (10000, 14)和Label: (10000, 1)

14个特征中有13个是字符串格式的，因此我对这13个字符串特征使用了来自sklearn.preprocessing的OneHotEncoder，如下所示（1个特征是浮点格式）。下面我仅显示一个功能示例：

values = array(df['receiver_bic']) # This is one of the features, BIC-code for banks like "HANDSESS", in string format with limited values

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
receiver_bic_onehot = onehot_encoder.fit_transform(integer_encoded)

The shape of the final array RECEIVER_BIC_ONEHOT: (10.000, 622)

在对每个字符串特征（13个特征）执行相同的过程后，我得到了onehot编码的特征尺寸，如下所示：

# Shapes of 13 OneHot_encoded features
(10000, 622) ,(10000, 397), (10000, 325), (10000, 331), (10000, 319), (10000, 235), (10000, 24), (10000, 4), (10000, 196), (10000, 78), (10000, 118), (10000, 128), (10000, 55)

最后，我在X下收集了以下功能：

X=np.c_[OneHot_Feature_1, OneHot_Feature_2, ... , OneHot_Feature_13, Numeric_Feature_14]

y = df[target_col] # Target column

X = np.array(X) # Converting Feature and Target to numpy arrays
y = np.array(y)

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

最后我得到的数组形状如下所示

Training Features Shape: (7000, 2833)
Training Labels Shape: (7000, 1)
Testing Features Shape: (3000, 2833)
Testing Labels Shape: (3000, 1)

在模型中使用之前，我将Features: X转换为StandardScaler()

scaler = StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

最后，我将这些数组插入到RandomForestRegressor模型中

est_RFR = RandomForestRegressor(n_estimators=10) 
est_RFR = est_RFR.fit(X_train,y_train.ravel()) # ravel() is needed to convert the (n,1) shape into (n,)

我的问题是:

我上面使用OneHotEncoder处理多个特性的过程正确吗
即使它是正确的，OneHotEncoder之前的X.shape是(10000, 14)，OneHotEncoder之后的X.shape是(10000, 2833)。我的直觉告诉我，我在模型中有不必要的大量Feature-columns，使用2833列而不是14列。有没有更合适的使用方法
我试图用inverted = label_encoder.inverse_transform([argmax(receiver_bic_onehot[:, :])])将OneHot编码的值转换回它们的原始值。但是print(inverted)的输出只给出一个原始值，而不是整个列。我应该如何编写此代码

Tags：模型 test encoder transform train integer 特征 array

0条回答

目前没有回答

我是否在MachineLearning模型中正确使用onehotencoded功能？

相关问题更多 >

编程相关推荐

热门问题

热门文章

我是否在MachineLearning模型中正确使用onehotencoded功能？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >