为什么管道中的过采样会爆炸模型系数的数量？

from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer, make_column_transformer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # define preprocessor preprocess = make_column_transformer( (StandardScaler(), ['attr1', 'attr2', 'attr3', 'attr4', 'attr5', 'attr6', 'attr7', 'attr8', 'attr9']), (OneHotEncoder(categories='auto'), ['attrcat1', 'attrcat2']) ) # define train and test datasets X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=0)

# don't do over-sampling in this case os_X_train = X_train os_y_train = y_train print('Training data is type %s and shape %s' % (type(os_X_train), os_X_train.shape)) logreg = LogisticRegression(penalty='l2',solver='lbfgs',max_iter=1000) model = make_pipeline(preprocess, logreg) model.fit(os_X_train, np.ravel(os_y_train)) print("The coefficients shape is: %s" % logreg.coef_.shape) print("Model coefficients: ", logreg.intercept_, logreg.coef_) print("Logistic Regression score: %f" % model.score(X_test, y_test))

Training data is type <class 'pandas.core.frame.DataFrame'> and shape (87145, 11) The coefficients shape is: (1, 47) Model coefficients: [-7.51822124] [[ 0.10011794 0.10313989 ... -0.14138371 0.01612046 0.12064405]] Logistic Regression score: 0.999116

from imblearn.over_sampling import SMOTE # balance the classes by oversampling the training data os = SMOTE(random_state=0) os_X_train,os_y_train=os.fit_sample(X_train, y_train.ravel()) os_X_train = pd.DataFrame(data=os_X_train, columns=X_train.columns) os_y_train = pd.DataFrame(data=os_y_train, columns=['response'])

Training data is type <class 'pandas.core.frame.DataFrame'> and shape (174146, 11) The coefficients shape is: (1, 153024) Model coefficients: [12.02830778] [[ 0.42926969 0.14192505 -1.89354062 ... 0.008847 0.00884372 -8.15123962]] Logistic Regression score: 0.997938

1条回答

网友

1楼 · 发布于 2024-04-25 21:40:42

好吧，我找到了这个问题的罪魁祸首。问题是SMOTE将所有特性列转换为float（包括这两个分类特性）。因此，当对column types float应用columns transformer OneHotEncoder时，会将列数分解为样本数，即它将相同float值的每次出现视为不同的类别。你知道吗

解决方案只是在运行管道之前将这些分类列类型转换回int：

# balance the classes by over-sampling the training data
os = SMOTE(random_state=0)
os_X_train, os_y_train = os.fit_sample(X_train, y_train.ravel())
os_X_train = pd.DataFrame(data=os_X_train, columns=X_train.columns)
# critically important to have the categorical variables from float back to int
os_X_train['attrcat1'] = os_X_train['attrcat1'].astype(int)
os_X_train['attrcat2'] = os_X_train['attrcat2'].astype(int)

相关问题更多 >

编程相关推荐

热门问题

热门文章